# Data cleaning

Data cleaning is a crucial step in the data analysis process, which involves identifying and correcting errors, inconsistencies, and discrepancies in the data. The main steps involved in data cleaning are as follows:

- **Identify missing values**: The first step in data cleaning is to identify any missing values in the data. Missing values can arise due to a variety of reasons, such as incomplete data collection, errors in data entry, or data corruption. Identifying and handling missing values is important as they can affect the results of the analysis.

- **Remove duplicates**: Duplicate data points can occur due to errors in data collection or data entry. Duplicate data can distort the analysis and lead to incorrect results. Identifying and removing duplicates is an important step in data cleaning.

- **Correct inconsistent data**: Inconsistent data can arise due to errors in data entry, data corruption, or data integration from multiple sources. Inconsistent data can include spelling errors, numerical errors, or discrepancies in the format of data. Correcting inconsistent data involves identifying the errors and making the necessary corrections.

- **Standardize data**: Standardizing data involves converting data into a consistent format. For example, converting all dates into a standard format or converting all text to lowercase. Standardizing data is important for analysis as it allows for easier comparison and analysis.

- **Handle outliers**: Outliers are data points that are significantly different from the other data points. Outliers can arise due to errors in data collection, data corruption, or genuine differences in the data. Handling outliers involves identifying the outliers and deciding how to handle them. Outliers can be removed or treated differently in the analysis.

- **Validate data**: Validating data involves checking the data for accuracy and consistency. This involves checking the data against known values or sources of information to ensure that it is accurate and consistent.

Overall, data cleaning is an iterative process that involves identifying and correcting errors in the data until the data is clean and ready for analysis.

### Titanic

The Titanic dataset is a famous dataset used in the field of data science and machine learning. It contains information about the passengers who were aboard the RMS Titanic when it sank on its maiden voyage in April 1912. The dataset is often used as an introductory dataset for learning data analysis and machine learning algorithms.

The Titanic dataset contains the following information for each passenger:

- PassengerId: A unique identifier for each passenger
- Survived: A binary variable indicating whether the passenger survived (1) or not (0)
- Pclass: The passenger's class (1st, 2nd, or 3rd)
- Name: The passenger's name
- Sex: The passenger's gender
- Age: The passenger's age
- SibSp: The number of siblings or spouses the passenger had aboard the Titanic
- Parch: The number of parents or children the passenger had aboard the Titanic
- Ticket: The passenger's ticket number
- Fare: The fare paid by the passenger
- Cabin: The passenger's cabin number
- Embarked: The port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)

The original dataset contains a total of 891 rows (here 894 rows), corresponding to the number of passengers in the dataset. The Survived column is the target variable, and the other columns are used as features for predicting whether a passenger survived or not. The dataset is often used to build predictive models to determine which passengers were more likely to survive the sinking of the Titanic based on their characteristics.

In [None]:
import pandas as pd

titanic = pd.read_csv('data/titanic_with_dub.csv')
titanic.head()

#### Missing values

Dealing with missing values: The Titanic dataset has missing values in the Age, Cabin, and Embarked columns. One approach to handling missing values is to impute the missing values with mean or median values. Alternatively, rows with missing values can be dropped from the dataset.

Imputation of missing values can potentially cause a lot of bias, and you should be careful thinking about the underlying reasons for data being missing and the potential bias caused by imputation.

You can read more in this reference: [GarcÃ­a, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining.](https://d1wqtxts1xzle7.cloudfront.net/60477900/Garcia__Luengo__Herrera-Data_Preprocessing_in_Data_Mining_-_Springer_International_Publishing_201520190903-77973-th1o73-libre.pdf?1567544443=&response-content-disposition=inline%3B+filename%3DIntelligent_Systems_Reference_Library_72.pdf&Expires=1679794027&Signature=Tbl8YhiUQworlYTbuS6GmJdj94mgY2vfpY86Tk7cVQEgk4qXV9~bjXxEjJWZgYxGEp724F2KkJU-WM9euX46J0d-6OlQBekLA8o7GcJ0SUNoXrE2gNzbr5SExsKeMqAYfBtmZVzlwkWTgL7WCha7lXhtPJmnmTMYl0wRiV1QA4MuAZUN-lliWU9SKdut48~KCDXRQ-sybHdakWoEL7Q1nq4JTXxreu~eMs996UJqylo0dftBtab6AGENHCw3FKUSi6CnekNrOV6fGISRIS1vcZaZdeZlfr5ywHjaQIGvobWS0--k6KtS9wVvl-28RZKbzrp2AbAqw2slXmrE-ADjbA__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA)

In [None]:
x = titanic.iloc[[40,358,789]]
titanicdub = pd.concat([titanic, x])
titanicdub.info()
titanicdub.to_csv('data/titanic_with_dub.csv', index=False)

In [None]:
titanic.info()

In [None]:
titanic.isna().any()

In [None]:
titanic[titanic.Embarked.isna()]

In [None]:
titanic[titanic.Cabin.isna()]

In [None]:
pd.crosstab(titanic.Pclass, titanic.Cabin.isna())

In [None]:
titanic[titanic.Age.isna()]

In [None]:
titanic.groupby('Pclass').Age.describe()

In [None]:
import matplotlib.pyplot as plt
plt.scatter(titanic['Fare'], titanic['Age'], c=titanic['Pclass'])

We will impute missing Age using linear regression.

remember to add sklearn to your environment `poetry add  scikit-learn`

In [None]:
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
columns = ['Pclass']
titanic.head()

testdf = titanic.loc[titanic['Age'].isnull()]
traindf = titanic.loc[titanic['Age'].isnull()==False]

lr.fit(traindf[columns],traindf['Age'])

pred = lr.predict(testdf[columns])
testdf['Age']= pred

In [None]:
titanic = pd.concat([testdf, traindf], ignore_index=True)

In [None]:
titanic.info()

#### Removing dublicates
Dealing with duplicates: Check for duplicates in the dataset, which can arise due to data entry errors or data collection methods. Duplicates can be dropped or the data can be aggregated to remove duplicate values.

There are no duplicates in the titanic data.

In [None]:
titanic[titanic.duplicated()]

In [None]:
#Drop duplicate rows
titanic.drop_duplicates(inplace=True)

#### Correct inconsistent data

In [None]:
titanic.info()

In [None]:
# Convert Sex column to binary variable
titanic["Sex"] = pd.get_dummies(titanic["Sex"]).male

In [None]:
titanic.head()

In [None]:
titanic.describe()

#### Handle outliers

In [None]:
import matplotlib.pyplot as plt # pip install matplotlib

# Create boxplots to visualize outliers in numerical variables
fig, axs = plt.subplots(1, 2, figsize=(10,5))
axs[0].boxplot(titanic["Age"])
axs[0].set_title("Boxplot of Age")
axs[1].boxplot(titanic["Fare"])
axs[1].set_title("Boxplot of Fare")
plt.show()


*A boxplot from the Matplotlib library shows the median value as a horizontal line inside a box that represents the interquartile range (IQR), with the lower and upper whiskers indicating the lowest and highest non-outlier values within 1.5 times the IQR of the lower and upper quartile, respectively. Outliers are displayed as individual points outside the whiskers.*

In [None]:
# Calculate z-scores to identify outliers in numerical variables
from scipy import stats
z_scores_age = stats.zscore(titanic["Age"])
z_scores_fare = stats.zscore(titanic["Fare"])
threshold = 3
outliers_age = titanic["Age"][abs(z_scores_age) > threshold]
outliers_fare = titanic["Fare"][abs(z_scores_fare) > threshold]
print("Outliers in Age:", outliers_age)
print("Outliers in Fare:", outliers_fare)

#### Standardize data

In [None]:
# Scale numerical variables
from sklearn.preprocessing import StandardScaler, MaxAbsScaler
scaler = MaxAbsScaler()
titanic[["Age", "Fare"]] = scaler.fit_transform(titanic[["Age", "Fare"]])

return to [overview](../00_overview.ipynb)