## Author: Dere, Abdulhameed Abiola

## DATA CLEANING

To build any machine learning model or work as a data scientist, it is important to understand how to clean a dataset. In more than 90% of cases, the data will be "messy", therefore this notebook aims to explore the common cleaning methods which you should be conversant and comfortable with after tis session.

### For this session, we would make use of two datasets. 

In [1]:
# import the required libraries

import pandas as pd
import numpy as np

In [2]:
# Read in the required dataset

data = pd.read_csv('Loan prediction train.csv')

data.head()

Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [3]:
df = pd.read_csv('Big Mart Sales Prediction Train.csv')

df.head()
# df.info()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [4]:
# Check the number of rows and columns
data.shape()

TypeError: 'tuple' object is not callable

In [None]:
# Check additional information about the dataframe
data.info()

## DESCRIBING/SORTING DATA

Describe function is used to carry out some statistical functions on the dataframe. Describe functions works only on the integer and float values BUT does not work on the object values i.e. string values 

In [None]:
# data.describe() method gives an idea of the nature of the data that we are working with.
# It also makes it easy to draw out inferences from the dataset
# This also gives an idea of the presence of an outlier or not, which can later be confirmed by visualization
data.describe()

In [None]:
# Statsitical inference of the categorical Dataset
data.describe(include = 'object')

## SORT FUNCTION

This is used to arrange the data in a dataframe in a chronological order either in an ascending or descending format.

In [None]:
# This sort the dataframe based on the Educational status, i.e all graduates come first before the non-graduates.
data.sort_values("Education")
# data.sort_values('ApplicantIncome', ascending = False)

## MAKING CHANGES TO THE DATAFRAME

### Adding a new column to the dataframe
Assuming we decide to get the difference between the income of the applicant and the coapplicant
And we want to put that into a new column

In [None]:
# Instatiate a new column
data['IncomeDifference'] = data['ApplicantIncome'] - data['CoapplicantIncome']

In [None]:
data.head()

### Dropping an existing column from the dataframe

In [None]:
data = data.drop(columns = 'IncomeDifference')
# data = data.drop('IncomeDifference', axis = 1)  This is another method of dropping an existing column
data.head(2)

## FILTERING DATA

This allows pandas to filter through the entire dataframe in search for specific rows/columns and further reduces the dataframe size

In [None]:
data.loc[(data['Education'] == 'Graduate') & (data['Dependents'] == '0')]

In [None]:
# To shrinken the size of a dataframe based on name tags

df.loc[df['Item_Type'].str.contains('Meat')]

## AGGREGATE STATISTICS (GROUPBY)

This allows pandas to be able to perform a lot more statistical operations with datadrame set

In [None]:
df.groupby(['Item_Fat_Content']).mean()

In [None]:
df.groupby(['Item_Fat_Content']).mean().sort_values('Outlet_Establishment_Year')

In [None]:
df.groupby(['Item_Type']).mean().sort_values('Item_Outlet_Sales', ascending= False)

## CHECKING FOR DUPLICATE ROWS/COLUMNS

In [None]:
duplicated = df.duplicated().sum()
if duplicated:
    print('Duplicate rows in Dataset are {}'.format(duplicate))
else:
    print('Dataset contains no duplicate values')

In [None]:
# To drop duplicates, the below function is used:

# df = df.drop_duplicates()

## HANDLING MISSING DATA

For missing values, there are 3 ways to deal with it
1. Leave it as it is
2. Drop them
3. Fill the missing values 

In [None]:
# Checking for missing values
df.isnull().sum()

In [None]:
# This function returns False Statement if at least one column has a missing value
df.notnull().all().all()

### Filling a missing value

An empty dataframe usually represented by Nan can either be dropped if it is not significant to the dataframe.
If it is significant or the missing values are little, then, it has to be filled. the appropriate method of filling depends on the nature of the column that contains the empty string.

Categorical Dataset = Fill with mode (i.e the value that appears most)
Numerical/Float dataset = fill with mean or median

if skewed, fill with median
if not skewed, fill with mean or median

In [None]:
df.info()

In [None]:
# Dropping a particular column

new_df = df.drop('Outlet_Size', axis=1)

new_df.info()

In [None]:
# Filling another column
# It is evident now that there is no empty column anymore

new_df2 = new_df.fillna(df.mean())

new_df2.info()

In [None]:
new_df2.notnull().all().all()