# Preprocessing and Categorical Data Encoding

Hello there! In this lesson, we'll learn one of the approaches to modeling in data science: statistical modeling.

Here we will mostly use the Stats module in the [Scipy library](https://docs.scipy.org/doc/scipy/reference/stats.html) for performing correlations and statistical tests.

Jump ahead to the [Modeling Data: Statistical Modeling](#statistical-modeling) section to get right into the lesson.

We'll use the Titanic data as our main data set. 

### Loading the data

In [32]:
# Load datasets
import pandas as pd
import gspread

sa = gspread.service_account(filename="group-28-dataset-7233caedfe09.json")
sheet = sa.open("Dataset - Group 28")
work_sheet = sheet.worksheet("Clean Data")

df = pd.DataFrame(work_sheet.get_all_values())

new_header = df.iloc[0] 
df = df[1:] 
df.columns = new_header 

df.head(2).style



0,ID,Timestamp,Tweet URL,Group,Topic,Keywords,Account handle,Account name,Account type,Tweet,Tweet Translated,Tweet Type,Date posted,Content type,Reasoning,Thread/Tweet Language
1,28-1,27/02/2023 13:34:22,https://twitter.com/ggnelhsa/status/1430131446719549442,28,"COVID-19 vaccines contain microchip, magnetic objects, and other substances","""vaccine"" ""kutsara""",@ggnelhsa,ᴀʙᴏ,Anonymous,Gagi astig ng vaccine ko astra yung akin eh dinikit ko yung kutsara sa pinasukan ng karayom aba dumikit HAHAHA! May magnet ata 😂,"My vaccine was so cool, mine was astra, I placed the spoon where the needle went in, but it stuck HAHAHA! It has a magnet 😂",Text,24/08/21 19:34,Emotional,"Claims that vaccine makes a person magnetic. As checked by Vera Files, vaccines do not make people magnetic. https://verafiles.org/articles/vera-files-fact-check-covid-19-vaccines-do-not-have-magneto",Filipino
2,28-2,27/02/2023 13:40:42,https://twitter.com/Jheysi0208/status/1428597590916952067,28,"COVID-19 vaccines contain microchip, magnetic objects, and other substances","""vaccine"" ""spoon""",@Jheysi0208,Jheysi | 제이시 | ジェイシー,Anonymous,1st dose done!! {Image: picture of thei vaccine} Normal naman na sumakit yung arm na hindi mo maitaas ng husto noh? Hahahaha naprapraning akooo :))) Arm is ok now! Di na siya masakit! Pero tinry ko dikitan ng spoon and dumikit siyaaaaaa hhahahahahaha ommgg,"1st dose done!! {Image: picture of thei vaccine} It's normal for the arm to hurt when I raise it, right? Hahahaha I'm so paranoid :))) Arm is ok now! It doesn't hurt anymore! But when I try to place a spoon, it sticks hahahahahaha ommgg","Text, Image",20/08/21 13:59,Emotional,"Claims that vaccine makes a person magnetic. As checked by Vera Files, vaccines do not make people magnetic. https://verafiles.org/articles/vera-files-fact-check-covid-19-vaccines-do-not-have-magneto",Filipino


Let's now explore the data. There's no single right way of doing this. Every data has its own characteristics. But here are some of the standard steps we can do when exploring the data.

### Checking for Missing Values

First, let's check how the data is organized. 

Training set has 891 samples, while testing set contains 418 samples. But some columns have missing values, such as `Age`.

In [33]:
print("Check Missing Values:")
df_copy = df.copy(deep=True) 
df_copy.isnull().sum()

Check Missing Values:


0
ID                       0
Timestamp                0
Tweet URL                0
Group                    0
Topic                    0
Keywords                 0
Account handle           0
Account name             0
Account type             0
Tweet                    0
Tweet Translated         0
Tweet Type               0
Date posted              0
Content type             0
Reasoning                0
Thread/Tweet Language    0
dtype: int64

In [35]:
print("Dataset summary:")
df.describe().style

Dataset summary:


0,ID,Timestamp,Tweet URL,Group,Topic,Keywords,Account handle,Account name,Account type,Tweet,Tweet Translated,Tweet Type,Date posted,Content type,Reasoning,Thread/Tweet Language
count,151,151,151,151,151,151,151,151,151,151,151,151,151,151,151,151
unique,151,132,151,1,1,21,140,138,4,151,151,10,148,4,119,3
top,28-1,1/03/2023 02:28:22,https://twitter.com/ggnelhsa/status/1430131446719549442,28,"COVID-19 vaccines contain microchip, magnetic objects, and other substances","""covid"" ""vaccine"" ""magnet"" ""implant"" ""kutsara"" ""spoon"" ""barya"" ""coin"" ""microchip"" ""metal"" ""track"" ""robot""",@ven_cuenca,Ven cuenca,Anonymous,Gagi astig ng vaccine ko astra yung akin eh dinikit ko yung kutsara sa pinasukan ng karayom aba dumikit HAHAHA! May magnet ata 😂,"My vaccine was so cool, mine was astra, I placed the spoon where the needle went in, but it stuck HAHAHA! It has a magnet 😂","Text, Reply",08/12/20 19:14,Emotional,Claims that SinoVac only contains water. This is false as shown in the product information. https://www.moh.gov.sg/docs/librariesprovider5/vaccination-matter/annex-2---sinovac-vaccination-information-sheet-170621.pdf,Foreign
freq,1,7,1,151,151,41,4,4,110,1,1,67,2,82,13,78


> *What type/s of data are present?*

* Categorical (Qualitative): `PassengerId`, `Survived`, `Name`, `Sex`, `Ticket`, `Cabin`, `Embarked`, `Pclass`
* Numerical (Quantitative): `Age`, `SibSp`, `Parch`, `Fare`

### Ensuring Formatting Cosistency

Another great way to explore the data is to visualize the features. 

Let's visualize the numerical features of both the training and testing sets.

### Visualizing data relationships

Another great way to explore the data is to visualize the features. 

Let's visualize the numerical features of both the training and testing sets.