 # TASKS: Data Cleaning

1. Convert the date column to datetime datatype.
2. Standardize the column names.
3. Handle missing values appropriately, if they exist. Justify your chosen method for
handling the missing values.
4. Are there duplicates in the dataset? If yes, record the number of duplicates and handle
appropriately.
5. Handle outliers in the dataset, if they exist. Justify your method of handling the
outliers.


In [2]:
# import libraries
import pandas as pd

In [3]:
#install pyarrow
!pip install pyarrow



In [4]:
# loading my dataset

cup_df = pd.read_csv('world_cup_matches.csv')

In [5]:
# Making a copy of it for refrence
df = cup_df.copy()

# 1. Convert the date column to datetime datatype.

In [6]:
# changing date  
df['Date'] = pd.to_datetime(df['Date'])

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   ID              900 non-null    int64         
 1   Year            900 non-null    int64         
 2   Date            900 non-null    datetime64[ns]
 3   Stage           900 non-null    object        
 4   Home Team       900 non-null    object        
 5   Home Goals      900 non-null    int64         
 6   Away Goals      900 non-null    int64         
 7   Away Team       900 non-null    object        
 8   Win Conditions  62 non-null     object        
 9   Host Team       900 non-null    bool          
dtypes: bool(1), datetime64[ns](1), int64(4), object(4)
memory usage: 64.3+ KB


In [None]:
2. Standardize the column names.

In [9]:
#change column to lowercase

df.columns = df.columns.str.lower()

In [10]:
df.columns

Index(['id', 'year', 'date', 'stage', 'home team', 'home goals', 'away goals',
       'away team', 'win conditions', 'host team'],
      dtype='object')

In [11]:
# replace whitespace
df.columns = df.columns.str.replace(' ', '_')

In [12]:
df.columns

Index(['id', 'year', 'date', 'stage', 'home_team', 'home_goals', 'away_goals',
       'away_team', 'win_conditions', 'host_team'],
      dtype='object')

In [13]:
df

Unnamed: 0,id,year,date,stage,home_team,home_goals,away_goals,away_team,win_conditions,host_team
0,1,1930,1930-07-13,Group stage,France,4,1,Mexico,,False
1,2,1930,1930-07-13,Group stage,United States,3,0,Belgium,,False
2,3,1930,1930-07-14,Group stage,Yugoslavia,2,1,Brazil,,False
3,4,1930,1930-07-14,Group stage,Romania,3,1,Peru,,False
4,5,1930,1930-07-15,Group stage,Argentina,1,0,France,,False
...,...,...,...,...,...,...,...,...,...,...
895,896,2018,2018-07-07,Quarter-finals,Russia,2,2,Croatia,Croatia win on penalties (3 - 4),True
896,897,2018,2018-07-10,Semi-finals,France,1,0,Belgium,,False
897,898,2018,2018-07-11,Semi-finals,Croatia,2,1,England,Extra time,False
898,899,2018,2018-07-14,Third place,Belgium,2,0,England,,False


3. Handle missing values appropriately, if they exist. Justify your chosen method for handling the missing values.

In [14]:
# Handling missing conditions
df['win_conditions'] = df['win_conditions'].fillna("Normal Time")

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   id              900 non-null    int64         
 1   year            900 non-null    int64         
 2   date            900 non-null    datetime64[ns]
 3   stage           900 non-null    object        
 4   home_team       900 non-null    object        
 5   home_goals      900 non-null    int64         
 6   away_goals      900 non-null    int64         
 7   away_team       900 non-null    object        
 8   win_conditions  900 non-null    object        
 9   host_team       900 non-null    bool          
dtypes: bool(1), datetime64[ns](1), int64(4), object(4)
memory usage: 64.3+ KB


NB: You dont handle missing values alone, but with the business team,that has the domain knowledge, collaboration is key!

4. Are there duplicates in the dataset? If yes, record the number of duplicates and handle appropriately.

In [17]:
#checking for duplicates
# no duplicates in our dataset
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
895    False
896    False
897    False
898    False
899    False
Length: 900, dtype: bool

5. Handle outliers in the dataset, if they exist. Justify your method of handling the outliers.

In [18]:
## Handling outliers
df.describe()

Unnamed: 0,id,year,date,home_goals,away_goals
count,900.0,900.0,900,900.0,900.0
mean,450.5,1986.915556,1987-05-20 16:24:00,1.792222,1.038889
min,1.0,1930.0,1930-07-13 00:00:00,0.0,0.0
25%,225.75,1970.0,1970-06-14 00:00:00,1.0,0.0
50%,450.5,1990.0,1990-06-23 12:00:00,2.0,1.0
75%,675.25,2006.0,2006-06-19 00:00:00,3.0,2.0
max,900.0,2018.0,2018-07-15 00:00:00,10.0,7.0
std,259.951919,23.15027,,1.593279,1.081567


In [None]:
## NB: Handling outliers in a dataset you need to collaborate with the business to actually know what happened 
## To know if a particular outlier occured or happend

In [19]:
#we start by subsetting dataframe 
condition = df['home_goals'] > 6
df[condition]

Unnamed: 0,id,year,date,stage,home_team,home_goals,away_goals,away_team,win_conditions,host_team
24,25,1934,1934-05-27,Round of 16,Italy,7,1,United States,Normal Time,True
46,47,1938,1938-06-12,Quarter-finals,Sweden,8,0,Cuba,Normal Time,False
66,67,1950,1950-07-02,First round,Uruguay,8,0,Bolivia,Normal Time,False
70,71,1950,1950-07-09,Final round,Brazil,7,1,Sweden,Normal Time,True
81,82,1954,1954-06-17,Group stage,Hungary,9,0,South Korea,Normal Time,False
83,84,1954,1954-06-19,Group stage,Uruguay,7,0,Scotland,Normal Time,False
87,88,1954,1954-06-20,Group stage,Hungary,8,3,Germany,Normal Time,False
88,89,1954,1954-06-20,Group stage,Turkey,7,0,South Korea,Normal Time,False
91,92,1954,1954-06-23,Group stage,Germany,7,2,Turkey,Normal Time,False
105,106,1958,1958-06-08,Group stage,France,7,3,Paraguay,Normal Time,False


In [None]:
# for example Hungary and El salvador actually true
## if i want to see the average goal in a tournament then i have to remove the outliers, 
## they will highly impart my average because they are highly rarely
## we would need to remove these roles as it will impart the average when reporting to the management

5. Handle outliers in the dataset, if they exist. Justify your method of handling the outliers.

One approach is to remove outliers from the dataset. This can be done by filtering out data points that fall outside a certain range based on Z-score or IQR.

In [20]:
condition = df['home_goals'] > 6
df[condition]

Unnamed: 0,id,year,date,stage,home_team,home_goals,away_goals,away_team,win_conditions,host_team
24,25,1934,1934-05-27,Round of 16,Italy,7,1,United States,Normal Time,True
46,47,1938,1938-06-12,Quarter-finals,Sweden,8,0,Cuba,Normal Time,False
66,67,1950,1950-07-02,First round,Uruguay,8,0,Bolivia,Normal Time,False
70,71,1950,1950-07-09,Final round,Brazil,7,1,Sweden,Normal Time,True
81,82,1954,1954-06-17,Group stage,Hungary,9,0,South Korea,Normal Time,False
83,84,1954,1954-06-19,Group stage,Uruguay,7,0,Scotland,Normal Time,False
87,88,1954,1954-06-20,Group stage,Hungary,8,3,Germany,Normal Time,False
88,89,1954,1954-06-20,Group stage,Turkey,7,0,South Korea,Normal Time,False
91,92,1954,1954-06-23,Group stage,Germany,7,2,Turkey,Normal Time,False
105,106,1958,1958-06-08,Group stage,France,7,3,Paraguay,Normal Time,False
