### Data gathering

In [11]:
import pandas as pd

df = pd.read_csv("results.csv").copy()
df.head(5)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [12]:
df.shape

(45315, 9)

### Data Assessment

let's proceed to assess the quality and tidiness of our data

- checking for erronous data type

In [None]:
df.dtypes

- The date column has as data type object....we could consider changing it to date data type to permit easier extraction of month, year or day attributes during further analysis

In [14]:
df.describe()

Unnamed: 0,home_score,away_score
count,45315.0,45315.0
mean,1.739314,1.178241
std,1.746904,1.392095
min,0.0,0.0
25%,1.0,0.0
50%,1.0,1.0
75%,2.0,2.0
max,31.0,21.0


Here we observe a maximum of 31 and 21 goals...further research can be carried out to verify authenticity of this data

In [15]:
df.query('home_score==31')

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
24160,2001-04-11,Australia,American Samoa,31,0,FIFA World Cup qualification,Coffs Harbour,Australia,False


In [16]:
df.query('away_score==21')

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
27753,2005-03-11,Guam,North Korea,0,21,EAFF Championship,Taipei,Taiwan,True


- Upon verification to find if these number of goals were accurate, we found these statistics to be correct.

In [18]:
df.isnull().count()

date          45315
home_team     45315
away_team     45315
home_score    45315
away_score    45315
tournament    45315
city          45315
country       45315
neutral       45315
dtype: int64

- No null values, let's proceed to check for duplicates

In [24]:
sum(df.duplicated())

0

- No duplicate rows hence no issue to clean with respect to that

In [25]:
df.sample(45)

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
30217,2007-12-12,Tanzania,Somalia,1,0,CECAFA Cup,Dar es Salaam,Tanzania,False
33558,2011-07-03,Laos,Cambodia,6,2,FIFA World Cup qualification,Vientiane,Laos,False
6608,1967-08-02,Guyana,Suriname,0,2,Friendly,Georgetown,Guyana,False
4270,1957-10-02,Denmark,Republic of Ireland,0,2,FIFA World Cup qualification,Copenhagen,Denmark,False
45178,2023-11-16,Estonia,Austria,0,2,UEFA Euro qualification,Tallinn,Estonia,False
23029,2000-03-19,Dominica,Haiti,1,3,FIFA World Cup qualification,Roseau,Dominica,False
11240,1979-09-04,Fiji,Solomon Islands,2,0,South Pacific Games,Suva,Fiji,False
21247,1997-10-11,Sweden,Estonia,1,0,FIFA World Cup qualification,Solna,Sweden,False
2640,1946-10-08,Bulgaria,Romania,2,2,Balkan Cup,Tirana,Albania,True
5196,1962-09-05,Netherlands,Curaçao,8,0,Friendly,Amsterdam,Netherlands,False


### Data Cleaning 

#### Define

1 - Converting the date field to a proper data type


#### Code

In [35]:
df['date']=pd.to_datetime(df.date)

#### Test

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45315 entries, 0 to 45314
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date        45315 non-null  datetime64[ns]
 1   home_team   45315 non-null  object        
 2   away_team   45315 non-null  object        
 3   home_score  45315 non-null  int64         
 4   away_score  45315 non-null  int64         
 5   tournament  45315 non-null  object        
 6   city        45315 non-null  object        
 7   country     45315 non-null  object        
 8   neutral     45315 non-null  bool          
dtypes: bool(1), datetime64[ns](1), int64(2), object(5)
memory usage: 2.8+ MB


it can be observed that our date field data type has been changed....hence our cleaning was successful

### Data Visualisation