<a href="https://colab.research.google.com/github/IONUOHA/Accessing_DBs_with_Python/blob/main/GROUP2_ASSIGN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**GROUP TWO MEMBERS**

1. oluwatoyin **IBITOYE**
2. Onuoha Ikechukwu
3.
4.


**Project Description**

GoalBet is a sport data analytics company that analyzes historic sport
data. We build predictive models to predict sport scores for insight
analysis & Betting purpose. We retrieve data from a wide variety of
sources in a variety of formats.
Typically, we retrieve information relating to sports (football, tennis,
horse racing) results from both public web sites and commercial data
providers. The information is usually in a semi-structured format such
as CSV, JSON or Microsoft Excel. We then transform the data and
put it in a persistent store, such as a database. Later on, this data is
used by our data science team.

**Project Objective**

As our newly hired data engineer, you’re requested to:
• Build an end-to-end Extract, Transform & Load (ETL) pipeline that
pulls data from the website of one of our data providers -
https://www.football-data.co.uk/englandm.php
• Use your file explorer as the data lake for staging the extracted
raw & transformed data in CSV format.

• Use PostgreSQL for persisting the transformed data.

**Instructions**
- Your ETL logic should be in separate module
- Create a utility file for any helper function e.g Database
connection function.

- Your ETL pipeline should be run from a separate main module.

Required Data

For this project, you are required to extract only football data in these
Url formats:

• https://www.football-data.co.uk/mmz4281/1920/E0.csv
• https://www.football-data.co.uk/mmz4281/1920/E2.csv
• https://www.football-data.co.uk/mmz4281/0203/E1.csv

You can find a documentation describing the data here:
https://www.football-data.co.uk/notes.txt

**Required Data**

Data structure:
We are only interested in a subset of the information:
Div = League Division
Date = Match Date (dd/mm/yy)
Time = Time of match kick off
HomeTeam = Home Team
AwayTeam = Away Team
FTHG = Full Time Home Team Goals
FTAG = Full Time Away Team Goals

**DATA EXTRACTION**

In [None]:
# installing upgrade pandas to enablee us deal with the inconsistences in the different url
!pip install --upgrade pandas


Collecting pandas
  Downloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
Downloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.1/13.1 MB[0m [31m67.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.2
    Uninstalling pandas-2.2.2:
      Successfully uninstalled pandas-2.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.10.1 requires pandas<2.2.3dev0,>=2.0, but you have pandas 2.2.3 which is incompatible.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.2.3 w

In [None]:
#importing required library and also checking the version of pandas
import pandas as pd
print(pd.__version__)  # Should be 1.3.0 or higher


2.2.2


In [None]:
#extracting dataset from the provided url on the website

def extract_football_data(urls):
    """
    Extracts data from multiple URLs into a list of pandas DataFrames.

    Args:
        urls: A list of URLs to extract data from.

    Returns:
        A list of pandas DataFrames.
    """
    dataframes = []  # Create an empty list to store DataFrames

    for url in urls:
        try:
            # Read the CSV file from the URL, skip bad lines if there are any
            df = pd.read_csv(url, on_bad_lines='skip')
            dataframes.append(df)  # Add the DataFrame to the list
            print("Successfully extracted data from " + url)
        except Exception as e:
            # Print the error if there's an issue reading the CSV file
            print("Error extracting data from " + url + ": " + str(e))

    return dataframes  # Return the list of DataFrames



In [None]:
# URLs of the required datasets
urls = [
    "https://www.football-data.co.uk/mmz4281/1920/E0.csv",
    "https://www.football-data.co.uk/mmz4281/1920/E2.csv",
    "https://www.football-data.co.uk/mmz4281/0203/E1.csv",
]



In [None]:
# Extract the data from the list of URLs
dataframes = extract_football_data(urls)


Successfully extracted data from https://www.football-data.co.uk/mmz4281/1920/E0.csv
Successfully extracted data from https://www.football-data.co.uk/mmz4281/1920/E2.csv
Successfully extracted data from https://www.football-data.co.uk/mmz4281/0203/E1.csv


In [None]:
# Concatenate/merge all the DataFrames into one
concatenated_df = pd.concat(dataframes, ignore_index=True)


In [None]:
# Print the first few rows of the concatenated DataFrame
print(concatenated_df.head())

# Print the shape (number of rows and columns)
print("Shape of concatenated DataFrame:", concatenated_df.shape)

# Print the column names of the concatenated DataFrame
print("Columns in concatenated DataFrame:", concatenated_df.columns.tolist())


  Div        Date   Time        HomeTeam          AwayTeam  FTHG  FTAG FTR  \
0  E0  09/08/2019  20:00       Liverpool           Norwich     4     1   H   
1  E0  10/08/2019  12:30        West Ham          Man City     0     5   A   
2  E0  10/08/2019  15:00     Bournemouth  Sheffield United     1     1   D   
3  E0  10/08/2019  15:00         Burnley       Southampton     3     0   H   
4  E0  10/08/2019  15:00  Crystal Palace           Everton     0     0   D   

   HTHG  HTAG  ... LBD LBA  SOH  SOD  SOA  SBH  SBD  SBA  GB>2.5  GB<2.5  
0     4     0  ... NaN NaN  NaN  NaN  NaN  NaN  NaN  NaN     NaN     NaN  
1     0     1  ... NaN NaN  NaN  NaN  NaN  NaN  NaN  NaN     NaN     NaN  
2     0     0  ... NaN NaN  NaN  NaN  NaN  NaN  NaN  NaN     NaN     NaN  
3     0     0  ... NaN NaN  NaN  NaN  NaN  NaN  NaN  NaN     NaN     NaN  
4     0     0  ... NaN NaN  NaN  NaN  NaN  NaN  NaN  NaN     NaN     NaN  

[5 rows x 120 columns]
Shape of concatenated DataFrame: (1300, 120)
Columns in c

In [None]:
# Save the concatenated DataFrame to a CSV file
concatenated_df.to_csv('football_data_combined.csv', index=False)


concatenated_df


In [None]:
# downloading the final file into a csv file for
from google.colab import files

files.download('football_data_combined.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**something to note**

a copy of the cvs file should be downloaded to you machine by now but if you do not desire this skip the code chunk/cell just above this markdown.
to check the location of the csv file in your note either use get working directory code or check on the left panel and for content. your csv file should reside here.

In [None]:
df=pd.read_csv('/content/football_data_combined.csv')
print(df.tail)

<bound method NDFrame.tail of      Div        Date   Time        HomeTeam          AwayTeam  FTHG  FTAG FTR  \
0     E0  09/08/2019  20:00       Liverpool           Norwich     4     1   H   
1     E0  10/08/2019  12:30        West Ham          Man City     0     5   A   
2     E0  10/08/2019  15:00     Bournemouth  Sheffield United     1     1   D   
3     E0  10/08/2019  15:00         Burnley       Southampton     3     0   H   
4     E0  10/08/2019  15:00  Crystal Palace           Everton     0     0   D   
...   ..         ...    ...             ...               ...   ...   ...  ..   
1295  E1    04/05/03    NaN  Sheffield Weds           Walsall     2     1   H   
1296  E1    04/05/03    NaN           Stoke           Reading     1     0   H   
1297  E1    04/05/03    NaN         Watford  Sheffield United     2     0   H   
1298  E1    04/05/03    NaN       Wimbledon           Burnley     2     1   H   
1299  E1    04/05/03    NaN          Wolves         Leicester     1     1   D  

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    2 non-null      datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 160.0 bytes


**missing data**

handling missing values in date column.
to achieve this, it takes the most recent non-missing value and uses it to fill in any missing values that follow, until it encounters another valid value. this option give a close approximate date.

In [None]:

# Example DataFrame with missing dates
data = {'Date': ['2023-01-01', None, '2023-01-03', None, None, '2023-01-06']}
df = pd.DataFrame(data)

# Convert the 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Use ffill() to forward fill missing values without inplace=True (since it's a copy warning)
df['Date'] = df['Date'].ffill()

# Print the DataFrame after forward fill
print(df)


# Use ffill() to replace NaT with forward-filled values
df['Date'] = df['Date'].ffill()


        Date
0 2023-01-01
1 2023-01-01
2 2023-01-03
3 2023-01-03
4 2023-01-03
5 2023-01-06


**Extracting the required data**

Data structure: We are only interested in a subset of the information: Div = League Division Date = Match Date (dd/mm/yy) Time = Time of match kick off HomeTeam = Home Team AwayTeam = Away Team FTHG = Full Time Home Team Goals FTAG = Full Time Away Team Goals

In [None]:
# Define the columns you want to extract
columns_to_extract = ['Div', 'Date', 'Time','HomeTeam', 'AwayTeam']

# Load only the specific columns
df = pd.read_csv('/content/football_data_combined.csv', usecols=columns_to_extract)
df.head()

Unnamed: 0,Div,Date,Time,HomeTeam,AwayTeam
0,E0,09/08/2019,20:00,Liverpool,Norwich
1,E0,10/08/2019,12:30,West Ham,Man City
2,E0,10/08/2019,15:00,Bournemouth,Sheffield United
3,E0,10/08/2019,15:00,Burnley,Southampton
4,E0,10/08/2019,15:00,Crystal Palace,Everton


In [None]:

# Example DataFrame with missing dates
data = {'Date': ['2023-01-01', None, '2023-01-03', None, None, '2023-01-06']}
df = pd.DataFrame(data)

# Convert the 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

# Use ffill() to forward fill missing values without inplace=True (since it's a copy warning)
df['Date'] = df['Date'].ffill()

# Print the DataFrame after forward fill
print(df)


# Use ffill() to replace NaT with forward-filled values
df['Date'] = df['Date'].ffill()

        Date
0 2023-01-01
1 2023-01-01
2 2023-01-03
3 2023-01-03
4 2023-01-03
5 2023-01-06


In [None]:
df.tail()

Unnamed: 0,Date
1,2023-01-01
2,2023-01-03
3,2023-01-03
4,2023-01-03
5,2023-01-06
