# Day 1 - Afternoon

# 1. Data Cleaning and Data Encoding

## 1.1 Introduction to Data Cleaning

Data cleaning, also known as data cleansing or data scrubbing, refers to the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset. It is an essential step in data preprocessing and analysis to ensure that the data is accurate, reliable, and suitable for further analysis or modeling.

Data cleaning is important because real-world data is often imperfect. It can contain various issues such as missing values, duplicate records, incorrect formatting, inconsistent spellings, outliers, and more. These problems can arise due to human errors during data entry, technical glitches, or the integration of data from different sources.

The primary objectives of data cleaning are as follows:

1. **Removing or correcting errors:** Data cleaning involves identifying and addressing errors in the dataset. For example, it may involve fixing typos, resolving inconsistent date formats, or rectifying inaccurate numerical entries.


2. **Handling missing data:** Missing data refers to the absence of values in certain records or attributes. Data cleaning techniques help in dealing with missing data, which may involve imputing missing values based on statistical methods or removing records with excessive missing data.


3. **Handling duplicates:** Duplicates are identical or near-identical records that exist within a dataset. Data cleaning aims to identify and remove or merge duplicate records, ensuring that each unique entity is represented only once.


4. **Standardizing and transforming data:** Inconsistent formatting, units, or scales can hinder data analysis. Data cleaning involves standardizing variables, converting units, and transforming data to ensure consistency and compatibility across the dataset.


5. **Handling outliers:** Outliers are extreme values that deviate significantly from the typical pattern of the data. Data cleaning techniques help in identifying and dealing with outliers, which may involve removing them if they are due to data entry errors or handling them separately if they represent important observations.


Data cleaning is typically performed using a combination of manual and automated techniques. It requires domain knowledge, data exploration, and the use of various data cleaning tools and algorithms.

By performing effective data cleaning, analysts and data scientists can improve the quality of the data and enhance the accuracy and reliability of their subsequent analyses, predictive models, or decision-making processes.

## 1.2 Preparation

### 1.2.1 Import Libraries

In [36]:
import pandas as pd
import numpy as np
import warnings

# Ignore all warnings
warnings.filterwarnings("ignore")

In [45]:
!git clone https://github.com/MLcmore2023/MLcmore2023.git

Cloning into 'MLcmore2023'...
remote: Enumerating objects: 610, done.[K
remote: Counting objects: 100% (191/191), done.[K
remote: Compressing objects: 100% (126/126), done.[K
remote: Total 610 (delta 56), reused 167 (delta 40), pack-reused 419[K
Receiving objects: 100% (610/610), 99.74 MiB | 28.23 MiB/s, done.
Resolving deltas: 100% (235/235), done.
Updating files: 100% (172/172), done.


In [48]:
!mv ./MLcmore2023/'day1_pm_afternoon'/* ./MLcmore2023/'day1_pm_afternoon'/.* ./

In [37]:
import seaborn as sns

# Load Titanic dataset from seaborn
titanic = pd.read_csv('titanic.csv')

In [38]:
# display the dataframe
titanic.head()

Unnamed: 0,passenger_id,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,survived
0,1216,3,"Smyth, Miss. Julia",female,,0,0,335432,7.7333,,Q,13.0,,,1
1,699,3,"Cacic, Mr. Luka",male,38.0,0,0,315089,8.6625,,S,,,Croatia,0
2,1267,3,"Van Impe, Mrs. Jean Baptiste (Rosalie Paula Go...",female,30.0,1,1,345773,24.15,,S,,,,0
3,449,2,"Hocking, Mrs. Elizabeth (Eliza Needs)",female,54.0,1,3,29105,23.0,,S,4.0,,"Cornwall / Akron, OH",1
4,576,2,"Veal, Mr. James",male,40.0,0,0,28221,13.0,,S,,,"Barre, Co Washington, VT",0


### 1.2.2 Explore the dataset

In [40]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 850 entries, 0 to 849
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   passenger_id  850 non-null    int64  
 1   pclass        850 non-null    int64  
 2   name          850 non-null    object 
 3   sex           850 non-null    object 
 4   age           676 non-null    float64
 5   sibsp         850 non-null    int64  
 6   parch         850 non-null    int64  
 7   ticket        850 non-null    object 
 8   fare          849 non-null    float64
 9   cabin         191 non-null    object 
 10  embarked      849 non-null    object 
 11  boat          308 non-null    object 
 12  body          73 non-null     float64
 13  home.dest     464 non-null    object 
 14  survived      850 non-null    int64  
dtypes: float64(3), int64(5), object(7)
memory usage: 99.7+ KB


The Titanic dataset is a historical dataset that contains information about the passengers aboard the RMS Titanic, which was a British passenger liner that sank on its maiden voyage in April 1912 after colliding with an iceberg. The dataset has been made available for public use and is commonly used as a learning resource for data analysis, data visualization, and machine learning tasks.

The dataset provides a glimpse into the demographics and circumstances surrounding the passengers on the Titanic. It is often used to explore the factors that influenced survival rates and to build predictive models to determine the likelihood of a passenger surviving based on various features.

The columns (features) in the dataset are as follows:

1. PassengerId: An identifier for each passenger.
2. Pclass: The ticket class of the passenger (1st, 2nd, or 3rd class).
3. Name: The name of the passenger.
4. Sex: The gender of the passenger (male or female).
5. Age: The age of the passenger in years.
6. SibSp: The number of siblings or spouses onboard the Titanic with the passenger.
7. Parch: The number of parents or children onboard the Titanic with the passenger.
8. Ticket: The ticket number.
9. Fare: The passenger's fare or ticket price.
10. Cabin: The cabin number of the passenger.
11. Embarked: The port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
12. Boat: The lifeboat number if the passenger survived and was rescued.
13. Body: The body number if the passenger did not survive and their body was recovered.
14. Home.dest: The home or destination of the passenger.
15. Survived: This is the target variable and indicates whether the passenger survived or not. It is binary with 0 for not survived and 1 for survived.


## 1.2 Data Cleaning

### 1.2.1 Removing or correcting errors
1. **Removing or correcting errors:** Data cleaning involves identifying and addressing errors in the dataset. For example, it may involve fixing typos, resolving inconsistent date formats, or rectifying inaccurate numerical entries.


Correcting errors in the "sex" column of a dataset

df['column_name'] = df['column_name'].str.replace('incorrect_value', 'correct_value')

### Example

Changing "errors" in the "sex" column:

In [41]:
titanic.sex[0:5]

0    female
1      male
2    female
3    female
4      male
Name: sex, dtype: object

In [42]:
# male -> M
# female -> F
df['sex'].replace('male', 'M', inplace=True)
df['sex'].replace('female', 'F', inplace=True)

the parameter `inplace=True` is used to specify that the replacement operation should be performed directly on the original DataFrame, modifying it in place.

In [43]:
df.sex[0:5]

0    M
1    F
2    F
3    F
4    M
Name: sex, dtype: object

### 1.2.2 Handling missing data

2. **Handling missing data:** Missing data refers to the absence of values in certain records or attributes. Data cleaning techniques help in dealing with missing data, which may involve imputing missing values based on statistical methods or removing records with excessive missing data.

The first step is always to check missing values.

In [44]:
print(titanic.isnull().sum())

passenger_id      0
pclass            0
name              0
sex               0
age             174
sibsp             0
parch             0
ticket            0
fare              1
cabin           659
embarked          1
boat            542
body            777
home.dest       386
survived          0
dtype: int64


To fix missing data in a column, you can use various techniques depending on the nature of the missing values. Here are a few common approaches:

### 1.2.3 Removing missing values:

- If the missing values are relatively few and randomly distributed, you may choose to remove the rows or columns with missing values.
- Use the **dropna()** method in pandas to drop rows or columns with missing values. For example: **df.dropna().**

### Example
drop the row with nan value in “embarked” column:

In [27]:
titanic.dropna(subset=['embarked'], inplace=True)

In [28]:
# Now, the row with missing value in embarked column has been dropped
print(titanic.isnull().sum())

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
deck           688
embark_town      0
alive            0
alone            0
dtype: int64


### 1.2.4 Imputing missing values:

- If the missing values follow a certain pattern or have a relationship with other variables, you can fill them in with estimated or imputed values.
- Use the **fillna()** method in pandas to fill missing values with a specific value, mean, median, or any other desired imputation method. For example: **df['column_name'].fillna(value)**.

### Example
Filling missing values with median age in "age" column:


In [29]:
m = titanic['age'].median()
m

28.0

In [30]:
titanic['age'] = titanic['age'].replace(np.nan, m)

In [31]:
print(titanic.isnull().sum())

survived         0
pclass           0
sex              0
age              0
sibsp            0
parch            0
fare             0
embarked         0
class            0
who              0
adult_male       0
deck           688
embark_town      0
alive            0
alone            0
dtype: int64
