<a href="https://colab.research.google.com/github/BenHigginsData/pythonguides/blob/main/A_guide_to_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <u>**A guide to Pandas**</u>

The prupose of this notbook is to act as a referral for all your Pandas data transformation needs.

The dataset used for this notebook can be found here: *put a hyperlink here* 


# Prerequisites

Please run the cell directly below before continuing. The cell will load all necessary dependencies.

In [1]:
import pandas as pd
import numpy as np

# Pandas documentation: https://pandas.pydata.org/pandas-docs/stable/index.html

# Loading the data

In [2]:
# To get data located in github click on the file and view raw. Copy the url from the page that loads.
football_dataset_gh = 'https://raw.githubusercontent.com/BenHigginsData/pythonguides/main/Football%20Results%201872-2021.csv'

df = pd.read_csv(football_dataset_gh)

Other arguments within the pd.read_FILETYPE function that may be useful: ***Argument=*** Description

***header=*** Row number(s) to use as the column names, and the start of the data. <br>
***usecols=*** Return a subset of the columns. Specify with a list of numbers or column names, can use column letters with pd.read_excel. <br>
***index_col=*** Specify which column to use as an index.


# Basic dataset information

In [16]:
# View the first few rows of the dataset:
# by default the first 5 rows are shown; pass an integer as the argument to change the number of rows shown. 
# Passing a -ve number as the argument will show all but the bottom n rows.

df.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [None]:
# Information on each of the columns, such as the data type and number of missing values.

df.info()

In [None]:
# The number of rows and columns in the dataset

df.shape

In [None]:
# df.values creates a numpy array of all the values with each row as a list

df.values

In [None]:
# Column names

df.columns

In [None]:
# Summary statistics

df.describe()

In [None]:
# Summary of missing values per column

df.isna().sum()

In [None]:
# Unique values
df['city'].unique()

# We then get the length of the list to get a count of unique values
display(len(df['city'].unique()))

In [None]:
# Count of values per unique category entry
df['city'].value_counts()

# Transformation

## Cleaning

In [None]:
## Remove duplicates

df_no_dups =df.drop_duplicates()

In [None]:
# Remove certain characters from certain columns


# List of characters to remove
chars_to_remove = ["+", ",", "$"]

# List of column names to clean
cols_to_clean = ["tournament", "city"]

# Loop for replacing the characters with an empty string on the specified columns.
# for each column --> for each character --> replace the column's values with the character remove.
for col in cols_to_clean:
    for char in chars_to_remove:
        df[col] = df[col].apply(lambda value: value.replace(char, ""))
        

In [17]:
# Correcting data types with 'df.astype()'

df["home_score"] = df["home_score"].astype("int")

## Sorting

In [None]:
# Sort on a single column, descending by default
df.sort_values("country")

# Sort on multiple columns with multiple ascending/descending conditions
df.sort_values(["country", "city"], ascending=[False, True])

# Sort by the row index
df.sort_index()

In [None]:
# Organise data to tidy format with a day number
# Add a column for country
# Append the weather data together
# Join the weather data to the football data
# Subset the data to 'has weather data' for weather analysis.

In [None]:
# Can I get weather data --> date by city?

# Get a summary of missing values: 
df.isna().sum()

# Let's take a look at the data to check everything is as expected and readable.
display(df.head())
display(df.info())

# We can see that the date is not currently of type datetime. Let's change that by replacing the column with a datetime version.
df['date'] = pd.to_datetime(df['date'])

# There is a date column here. Let's add a year, month, day number, and weekday column to this data.
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()

# Let's check again to make sure the columns are as expected:
display(df.head())


date          0
home_team     0
away_team     0
home_score    0
away_score    0
tournament    0
city          0
country       0
neutral       0
year          0
month         0
day           0
weekday       0
dtype: int64

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,month,day,weekday
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,1872,11,30,<bound method PandasDelegate._add_delegate_acc...
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,1873,3,8,<bound method PandasDelegate._add_delegate_acc...
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,1874,3,7,<bound method PandasDelegate._add_delegate_acc...
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,1875,3,6,<bound method PandasDelegate._add_delegate_acc...
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,1876,3,4,<bound method PandasDelegate._add_delegate_acc...


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42483 entries, 0 to 42482
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date        42483 non-null  datetime64[ns]
 1   home_team   42483 non-null  object        
 2   away_team   42483 non-null  object        
 3   home_score  42483 non-null  int64         
 4   away_score  42483 non-null  int64         
 5   tournament  42483 non-null  object        
 6   city        42483 non-null  object        
 7   country     42483 non-null  object        
 8   neutral     42483 non-null  bool          
 9   year        42483 non-null  int64         
 10  month       42483 non-null  int64         
 11  day         42483 non-null  int64         
 12  weekday     42483 non-null  object        
dtypes: bool(1), datetime64[ns](1), int64(5), object(6)
memory usage: 3.9+ MB


None

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,year,month,day,weekday
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False,1872,11,30,Saturday
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False,1873,3,8,Saturday
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False,1874,3,7,Saturday
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False,1875,3,6,Saturday
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False,1876,3,4,Saturday


In [None]:
# Clean the data <-- check for NAs and remove any odd values.

# attack_df.isna().sum() <-- how many NAs
# attack_df = attack_df[~attack_df['Date'].isna()] <-- create a new df where 'Date' is not NA
# attack_df.isna().sum() <-- how many NAs now?

# display(df.isna().sum())
# display(df.shape)
df.describe()


Unnamed: 0,Min Delay,Min Gap,Vehicle
count,143917.0,143917.0,143917.0
mean,2.197802,3.254668,3938.319663
std,8.833085,9.117866,2447.175793
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,5211.0
75%,3.0,6.0,5596.0
max,999.0,999.0,72537.0


In [None]:
# Joining the codes data to get the vehicle type.


In [None]:
# Describe the data and get more info on the dataset like how many rows etc.

In [None]:
# Identify main categories and do a few group bys to sum up the stats
# Do min and max of a certian column


In [None]:
# Go through the various process to filter from query to iloc

In [None]:
# Talk about changing the dataset from wide to long

In [None]:
# Using datetime functions with pandas

In [None]:
# Binning data
# df['price_category'] = pd.cut(df.price, [-np.inf, 400, 1000, np.inf],
                              # labels=['low', 'medium', 'high'])

In [None]:
# Saving our main dataset now that it's clean. We can use this in the data viz guides.