___
# Step 3 - Process Data from Dirty to Clean
**Author: Alexandru Nitulescu**
___
### Table of Contents
* [Abstract](#section-one)
* [Cleaning the Dataset](#section-two)
    - [Importing required packages](#subsection-one)
    - [Preview the Dataset](#subsection-two)
    - [Summary](#subsection-three)
* [Database and normalization](#section-three)
    - [Preview of the Dataframe](#subsection-four)
		- [Analyzing the Dataframe from a Database Relationship aspect](#subsection-five)
		- [Normalizing the dataset](#subsection-six)
* [Data integrity](#section-four)
* [Further reading](#section-five)

<a id="section-one"></a>
### Abstract
Once we have obtained the necessary data and addressed any questions from the preparation phase, we can proceed to the processing section. Here, our main focus is on cleaning and pre-processing the data to ensure its accuracy, completeness, and correctness. We will examine different methods for cleaning data and emphasize the significance of data integrity in the data analysis process.

Moreover, we will explore essential aspects of data management, such as data redundancy and database normalization. Adhering to the best practices for data management can ensure that the project solution is both time-efficient and reusable.

<a id="section-two"></a>
### Cleaning the Dataset

<a id="subsection-one"></a>
#### Importing required packages
To begin the data cleaning process, we need to ensure that we have all the necessary libraries imported. In this project, we will be working with pandas and numpy to manipulate data, so it's essential that we have these packages installed and ready to use.

In [353]:
import pandas as pd
import numpy as np

<a id="subsection-two"></a>
#### Preview the Dataset
Prior to commencing the data cleaning process, it is crucial to have a clear understanding of the dataset's overall structure. Now we will involve a detailed examination of the data, as well as initial data exploration. Several key factors to keep in mind during the data preview are:

* Data completeness: Checking for any missing values in the dataset. If there are any missing values; we should then consider whether they are important for our analysis or not. Therefore we need to decide early on whether to remove or use methods to handle them.

* Data consistency: Is the format, data types and values consistent? For example, we need to ensure that categorical variables have consistent naming conventions, numeric variables are of the correct data type, and dates are formatted correctly.

* Data accuracy: We need to check if the data is accurate and free of errors, such as data entry errors, outliers or anomalies. This can be done through visual inspections, exploratory data analysis or statistical methods.

* Data relevance: We need to ensure that the data is relevant to our analysis and that we have included all the necessary variables. This can involve checking if there are any irrelevant or redundant variables that can be removed or transformed.

By examining these aspects during the data preview phase, we can detect any possible problems in the dataset early on and determine the most effective methods to clean and preprocess the data.

In [354]:
# Read the csv file
df = pd.read_csv("./data/raw_data.csv", sep=";", index_col=0)

In [355]:
# Preview the dataframe
df.head()

Unnamed: 0,MATCH UP,GAME DATE,W/L,MIN,PTS,FGM,FGA,FG%,3PM,3PA,...,FT%,OREB,DREB,REB,AST,TOV,STL,BLK,PF,+/-
0,ATL vs. DAL,04/02/2023,W,53,132,51,108,47.2,12,35,...,81.8,16,37,53,28,11,10,3,22,2
1,CHA vs. TOR,04/02/2023,L,48,108,42,85,49.4,15,31,...,69.2,10,27,37,26,18,3,4,11,-20
2,PHI @ MIL,04/02/2023,L,48,104,40,87,46.0,12,36,...,92.3,11,25,36,19,11,3,2,17,-13
3,POR @ MIN,04/02/2023,W,48,107,43,93,46.2,9,30,...,60.0,11,31,42,29,10,12,3,26,2
4,MIL vs. PHI,04/02/2023,W,48,117,46,80,57.5,10,28,...,71.4,7,35,42,28,12,8,5,17,13


In [356]:
# Check for null values
df.isnull().values.any()

False

In [357]:
# Check for row duplicates
df.duplicated().values.any()

False

In [358]:
# Show the data types of the dataframe
df.dtypes

MATCH UP      object
GAME DATE     object
W/L           object
MIN            int64
PTS            int64
FGM            int64
FGA            int64
FG%          float64
3PM            int64
3PA            int64
3P%          float64
FTM            int64
FTA            int64
FT%          float64
OREB           int64
DREB           int64
REB            int64
AST            int64
TOV            int64
STL            int64
BLK            int64
PF             int64
+/-            int64
dtype: object

In [359]:
# Display describe() function output with two decimals
df.describe().round(2)

Unnamed: 0,MIN,PTS,FGM,FGA,FG%,3PM,3PA,3P%,FTM,FTA,FT%,OREB,DREB,REB,AST,TOV,STL,BLK,PF,+/-
count,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0,2350.0
mean,48.37,114.63,41.92,88.2,47.66,12.32,34.14,35.94,18.47,23.61,78.24,10.43,32.95,43.38,25.23,14.11,7.29,4.64,20.07,0.0
std,1.46,11.86,5.05,7.29,5.47,3.94,6.98,8.49,5.77,6.83,9.73,3.99,5.37,6.74,4.87,3.93,2.86,2.45,4.05,13.8
min,48.0,80.0,28.0,64.0,30.4,2.0,15.0,10.3,2.0,4.0,30.0,0.0,16.0,23.0,12.0,2.0,0.0,0.0,8.0,-45.0
25%,48.0,107.0,39.0,83.0,43.8,10.0,29.0,30.0,14.0,19.0,72.2,8.0,29.0,39.0,22.0,11.0,5.0,3.0,17.0,-9.0
50%,48.0,114.0,42.0,88.0,47.5,12.0,34.0,35.7,18.0,23.0,78.85,10.0,33.0,43.0,25.0,14.0,7.0,4.0,20.0,0.0
75%,48.0,122.0,45.0,93.0,51.2,15.0,39.0,41.88,22.0,28.0,85.0,13.0,37.0,48.0,28.0,17.0,9.0,6.0,23.0,9.0
max,58.0,176.0,65.0,121.0,65.5,27.0,61.0,63.6,40.0,51.0,100.0,29.0,60.0,73.0,44.0,28.0,18.0,19.0,35.0,45.0


#### Summary
After previewing the dataset and considering data completeness, consistency, accuracy, and relevance, I have decided to make following changes:

1. Dropping the unwanted column(s) and renaming the column to make them more readable and consistent. Additionally, we will lowercase the column names to ensure they are in a consistent format, which will later on enable us to write easily accessible SQL queries. Any spaces will be replaced by underscore(_).

2. We need to extract the team abbreviation from the first three letters in the `MATCH UP` column and create a new column which we will name `MATCH UP`. This will allow us to easily identify which team the row is describing and simplify our analysis of the data

3. To ensure each row in the table is uniquely identified, we will create a new column named `match_id` that combines the date and the home teams abbreviation. It's important to note that this column will not be the primary key of the table. Instead, the primary key will consist of `match_id` and `team_id` combined, which we will define later in the process. Creating unique identifiers is an important aspect of data management, as it ensures each row is easily identifiable and can be used for analysis and modeling. 

**In SQL, a key is a field or combination of fields that uniquely identifies each row in a table. Keys are used to enforce data integrity and enable efficient querying of data.**

4. To ensure data integrity, we need to change the data type of the `GAME DATE` column to match the corresponding data. After this change, all data integrity requirements will have been met.

In [360]:
# Showcase the dataframes columns
df.columns

Index(['MATCH UP', 'GAME DATE', 'W/L', 'MIN', 'PTS', 'FGM', 'FGA', 'FG%',
       '3PM', '3PA', '3P%', 'FTM', 'FTA', 'FT%', 'OREB', 'DREB', 'REB', 'AST',
       'TOV', 'STL', 'BLK', 'PF', '+/-'],
      dtype='object')

In [361]:
# Drop unwanted columns from the dataframe
df = df.drop(["+/-"], axis=1)

In [362]:
# The new column names
new_cols = ['match_up', 'game_date', 'result', 'min', 'pts', 'fgm',
            'fga', 'fgp', 'tpm', 'tpa', 'tpp', 'ftm', 'fta','ftp','oreb',
            'dreb', 'reb','ast', 'tov', 'stl', 'blk', 'pf']

In [363]:
# We map the old columns to the new ones
rename_cols = dict(zip(df.columns, new_cols))

This line of code creates a dictionary `rename_cols` where the keys are the current column names in the DataFrame `df`, and the values are the new column names in the list `new_cols`.

The `zip()` function pairs up the current column names with the new column names based on their position in the list.

The `dict()` function creates a dictionary from the resulting pairs of key-value tuples.

Finally, the `rename()` method is called on the DataFrame `df` with the columns parameter set to `rename_cols` to map the old column names to the new ones.

In [364]:
print(rename_cols)

{'MATCH UP': 'match_up', 'GAME DATE': 'game_date', 'W/L': 'result', 'MIN': 'min', 'PTS': 'pts', 'FGM': 'fgm', 'FGA': 'fga', 'FG%': 'fgp', '3PM': 'tpm', '3PA': 'tpa', '3P%': 'tpp', 'FTM': 'ftm', 'FTA': 'fta', 'FT%': 'ftp', 'OREB': 'oreb', 'DREB': 'dreb', 'REB': 'reb', 'AST': 'ast', 'TOV': 'tov', 'STL': 'stl', 'BLK': 'blk', 'PF': 'pf'}


In [365]:
# Apply the mapper
df = df.rename(rename_cols, axis=1)

In [366]:
# Preview the dataframe at this point
df.head()

Unnamed: 0,match_up,game_date,result,min,pts,fgm,fga,fgp,tpm,tpa,...,fta,ftp,oreb,dreb,reb,ast,tov,stl,blk,pf
0,ATL vs. DAL,04/02/2023,W,53,132,51,108,47.2,12,35,...,22,81.8,16,37,53,28,11,10,3,22
1,CHA vs. TOR,04/02/2023,L,48,108,42,85,49.4,15,31,...,13,69.2,10,27,37,26,18,3,4,11
2,PHI @ MIL,04/02/2023,L,48,104,40,87,46.0,12,36,...,13,92.3,11,25,36,19,11,3,2,17
3,POR @ MIN,04/02/2023,W,48,107,43,93,46.2,9,30,...,20,60.0,11,31,42,29,10,12,3,26
4,MIL vs. PHI,04/02/2023,W,48,117,46,80,57.5,10,28,...,21,71.4,7,35,42,28,12,8,5,17


In [367]:
# Extract the first three letters from match_up and append it to the empty list "teams"
teams = []
for i in df['match_up']:
    teams.append(i[:3])

In [368]:
# The list is stored in a new column called "team_id"
df['team_id'] = teams

In [369]:
df.head()

Unnamed: 0,match_up,game_date,result,min,pts,fgm,fga,fgp,tpm,tpa,...,ftp,oreb,dreb,reb,ast,tov,stl,blk,pf,team_id
0,ATL vs. DAL,04/02/2023,W,53,132,51,108,47.2,12,35,...,81.8,16,37,53,28,11,10,3,22,ATL
1,CHA vs. TOR,04/02/2023,L,48,108,42,85,49.4,15,31,...,69.2,10,27,37,26,18,3,4,11,CHA
2,PHI @ MIL,04/02/2023,L,48,104,40,87,46.0,12,36,...,92.3,11,25,36,19,11,3,2,17,PHI
3,POR @ MIN,04/02/2023,W,48,107,43,93,46.2,9,30,...,60.0,11,31,42,29,10,12,3,26,POR
4,MIL vs. PHI,04/02/2023,W,48,117,46,80,57.5,10,28,...,71.4,7,35,42,28,12,8,5,17,MIL


In [370]:
def clean_matchup(value):
    """
    This function cleans up the match-up column by splitting it on 'vs.' or '@'
    and reversing the order of the teams if necessary. It also removes any
    spaces in the resulting string.
    """
    parts = value.split('vs.') if 'vs.' in value else value.split('@')
    if 'vs.' in value:
        return '-'.join(parts[::-1]).replace(' ', '')
    else:
        return '-'.join(parts).replace(' ', '')

In [371]:
# Apply the 'clean_matchup' to the 'match_up' column
df['match_up'] = df['match_up'].apply(clean_matchup)

In [372]:
# Iterating over the values in game_date ahd replace forward dashes with an empty string
game_dates = []
for i in df['game_date']:
    game_dates.append(i.replace('/', ''))

In [373]:
# Column "match_id" is created by combining the list "game_dates" with the home team's abbreviation extracted from the "match_up" column
df['match_id'] = game_dates + df['match_up'].str.split('-').str[-1]

In [374]:
df.head()

Unnamed: 0,match_up,game_date,result,min,pts,fgm,fga,fgp,tpm,tpa,...,oreb,dreb,reb,ast,tov,stl,blk,pf,team_id,match_id
0,DAL-ATL,04/02/2023,W,53,132,51,108,47.2,12,35,...,16,37,53,28,11,10,3,22,ATL,04022023ATL
1,TOR-CHA,04/02/2023,L,48,108,42,85,49.4,15,31,...,10,27,37,26,18,3,4,11,CHA,04022023CHA
2,PHI-MIL,04/02/2023,L,48,104,40,87,46.0,12,36,...,11,25,36,19,11,3,2,17,PHI,04022023MIL
3,POR-MIN,04/02/2023,W,48,107,43,93,46.2,9,30,...,11,31,42,29,10,12,3,26,POR,04022023MIN
4,PHI-MIL,04/02/2023,W,48,117,46,80,57.5,10,28,...,7,35,42,28,12,8,5,17,MIL,04022023MIL


In [375]:
# Replace forward slashes (/) in each cell for the "game_date" column with (-)
df['game_date'] = df['game_date'].apply(lambda x: x.replace('/', '-'))

In [376]:
# Change the "game_dates" to a datetime format
df['game_date'] = pd.to_datetime(df['game_date'])

In [377]:
df.head()

Unnamed: 0,match_up,game_date,result,min,pts,fgm,fga,fgp,tpm,tpa,...,oreb,dreb,reb,ast,tov,stl,blk,pf,team_id,match_id
0,DAL-ATL,2023-04-02,W,53,132,51,108,47.2,12,35,...,16,37,53,28,11,10,3,22,ATL,04022023ATL
1,TOR-CHA,2023-04-02,L,48,108,42,85,49.4,15,31,...,10,27,37,26,18,3,4,11,CHA,04022023CHA
2,PHI-MIL,2023-04-02,L,48,104,40,87,46.0,12,36,...,11,25,36,19,11,3,2,17,PHI,04022023MIL
3,POR-MIN,2023-04-02,W,48,107,43,93,46.2,9,30,...,11,31,42,29,10,12,3,26,POR,04022023MIN
4,PHI-MIL,2023-04-02,W,48,117,46,80,57.5,10,28,...,7,35,42,28,12,8,5,17,MIL,04022023MIL


In [378]:
# Reindex columns in the specified order
df = df.reindex(columns=['match_id','team_id','match_up', 'game_date', 'result', 'min', 'pts', 'fgm', 'fga', 'fgp',
       'tpm', 'tpa', 'tpp', 'ftm', 'fta', 'ftp', 'oreb', 'dreb', 'reb', 'ast',
       'tov', 'stl', 'blk', 'pf'])

In [379]:
df.head()

Unnamed: 0,match_id,team_id,match_up,game_date,result,min,pts,fgm,fga,fgp,...,fta,ftp,oreb,dreb,reb,ast,tov,stl,blk,pf
0,04022023ATL,ATL,DAL-ATL,2023-04-02,W,53,132,51,108,47.2,...,22,81.8,16,37,53,28,11,10,3,22
1,04022023CHA,CHA,TOR-CHA,2023-04-02,L,48,108,42,85,49.4,...,13,69.2,10,27,37,26,18,3,4,11
2,04022023MIL,PHI,PHI-MIL,2023-04-02,L,48,104,40,87,46.0,...,13,92.3,11,25,36,19,11,3,2,17
3,04022023MIN,POR,POR-MIN,2023-04-02,W,48,107,43,93,46.2,...,20,60.0,11,31,42,29,10,12,3,26
4,04022023MIL,MIL,PHI-MIL,2023-04-02,W,48,117,46,80,57.5,...,21,71.4,7,35,42,28,12,8,5,17


<a id="section-three"></a>
### Database and normalization
Normalization is a process of organizing the data in a database to reduce data redundancy and improve data integrity. It involves breaking down a database into smaller, more manageable tables and establishing relationships between them. The main purpose of normalization is to eliminate redundant data and ensure that each piece of data is stored in only one place. This helps to minimize the possibility of data inconsistencies and anomalies that can occur when data is duplicated or stored in multiple locations.

Normalization is especially important when designing large databases that store a lot of information. Without normalization, data redundancy can quickly become a problem, making it difficult to maintain data consistency and accuracy. By breaking down a database into smaller, more manageable tables and establishing relationships between them, normalization helps to ensure that the data is organized in the most efficient and effective way possible.

There are several levels of normalization, each with its own set of rules and guidelines. The most common levels are first normal form (1NF), second normal form (2NF), and third normal form (3NF). The higher the level of normalization, the more complex the database design becomes, but the greater the benefits in terms of data consistency and accuracy.

* First normal form (1NF): This level of normalization requires that each column of a table contains only atomic values (values that cannot be divided any further). It also requires that each row is uniquely identifiable, usually through the use of a primary key.

* Second normal form (2NF): In addition to meeting the requirements of 1NF, this level of normalization requires that all non-key attributes (columns) in a table are fully dependent on the primary key. This means that a table should not contain any partial dependencies, where some non-key attributes depend on only part of the primary key.

* Third normal form (3NF): This level of normalization requires that all non-key attributes (columns) in a table are dependent only on the primary key and not on any other non-key attributes. This means that a table should not contain any transitive dependencies, where a non-key attribute depends on another non-key attribute instead of directly on the primary key.

<a id="subsection-four"></a>
#### Preview of the Dataframe

In [380]:
df.head()

Unnamed: 0,match_id,team_id,match_up,game_date,result,min,pts,fgm,fga,fgp,...,fta,ftp,oreb,dreb,reb,ast,tov,stl,blk,pf
0,04022023ATL,ATL,DAL-ATL,2023-04-02,W,53,132,51,108,47.2,...,22,81.8,16,37,53,28,11,10,3,22
1,04022023CHA,CHA,TOR-CHA,2023-04-02,L,48,108,42,85,49.4,...,13,69.2,10,27,37,26,18,3,4,11
2,04022023MIL,PHI,PHI-MIL,2023-04-02,L,48,104,40,87,46.0,...,13,92.3,11,25,36,19,11,3,2,17
3,04022023MIN,POR,POR-MIN,2023-04-02,W,48,107,43,93,46.2,...,20,60.0,11,31,42,29,10,12,3,26
4,04022023MIL,MIL,PHI-MIL,2023-04-02,W,48,117,46,80,57.5,...,21,71.4,7,35,42,28,12,8,5,17


<a id="subsection-five"></a>
#### Analyzing the Dataframe from a Database Relationship aspect
The first step is to identify each column, also known as an attribute, and group them with each other. This can be achieved by creating an entity-relationship (ER) diagram. The ER diagram helps to visualize the relationships between different attributes and how they are related to each other. By grouping the columns together, we can better understand the cardinality between tables and the relationships between them. It is important to consider the cardinality between tables as it determines the number of instances of one entity that can be associated with the number of instances of another entity.

**1NF - First Normal Form**

From a database perspective, we can observe that there is duplicated data. Although the columns of "df", which from now on, we will call match_stats table, do not have a unique identifier, we can create a composite key by combining the "match_id" and "team_id" columns. Remember that the we should always have two rows with the same match_id, one showing the home teams stats and one for away team.
In SQL, a composite key is defined as a combination of two or more columns that create a unique row identifier. The table seems to conform to 1NF requirements since there are no repeating data groups and each attribute has a single value. However, it's important to note that the "match_up" column violates naming conventions by including spaces in its name. Furthermore, the "match_up" column contains redundant information since the home/away team abbreviation is already included in the "team_id" column.

**2NF - Second Normal Form**

Upon closer inspection, it is evident that the table contains some redundant data. Specifically, the "game_date" attribute is repeated for each row that corresponds to the same match. This violates the second normal form (2NF) which requires that all non-key attributes be dependent on the primary key. To eliminate the redundancy and adhere to 2NF, a separate table for unique game dates can be created. This will also help maintain data consistency and accuracy as any updates or changes to the date attribute will only need to be made in one place. The "game_dates" table can then be linked to the main table via a foreign key (date_id). The date can be retrieved through the date_id in the game_dates table.

**game_dates table**

| date_id      | date |
| ----------- | ----------- |
| 1      | 2022-10-18       |
| 2      | 2022-10-19       |
| ...    | ...              |

The match_results table is a result of further normalization from the match_stats table. This table stores the match results for each game played, including the match_id, team_id, date_id, result, minutes played, and total points scored. By separating the match results from the match stats, we reduce redundancy in the database and ensure data consistency and accuracy.

**match_results table**

| match_id | team_id | date_id | result | min | pts |
| ----------- | ----------- | ----------- | ----------- |----------- |----------- |
| 10182022GSW | GSW | 1 | W | 48 | 123|
| ... | ... | ... | ... | ... | ... |

The match_id, team_id, and date_id are all foreign keys referencing their respective tables, and together they form a composite primary key for the match_result table. This table allows us to easily retrieve the result and minutes played for each team in a particular match, as well as the total points scored. By grouping up this information into a single table, it becomes easier to manage and maintain the data in the database. Overall, the match_result table helps us to achieve a more efficient and organized database design. Notice also that we've now merged the date_id on the date.

**team_info table**

| team_id | team_name | arena_name | latitude | longtitude |
| ----------- | ----------- |----------- |----------- |----------- |
| GSW | Golden State Warriors | Chase Center | 37.768 |-122.3862 |
| ... | ... | ... | ... | ... |

The team_info table contains information about the team, including their team_id, team_name, arena_name, latitude and longtitude. The team_id serves as the primary key for this table, uniquely identifying each team. This information have I scraped and will be stored in a list variable.

After merging and normalizing the match_stats table we should be left with following table structure: 

**match_stats table**

| match_id | team_id | fgm | fga | fgp | tpm | tpa | tpp | ftm | fta | ftp | oreb | dreb | reb | ast | tov | stl | blk | pf |
| ----------- | ----------- |----------- |----------- |----------- |----------- | ----------- |----------- |----------- |----------- | ----------- | ----------- |----------- |----------- |----------- | ----------- | ----------- |----------- |----------- |
| 10182022GSW | GSW | 45 | 99 | 45.5 | 16 | 45 | 35.6 | 17 | 23 | 73.9 | 11 | 37 | 48 | 31 | 18 | 11 | 4 | 23 |
|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|...|

##### ER-diagram and Cardinality
* One-to-Many (1:N) between the "game_dates" and "match_results" tables, because a date can have many match results, but each match result belongs to one date.

* Many-to-One (N:1) between the "match_results" and "team_info" tables, because each match result is associated with a single team, but each team can have multiple match results over time.

* One-to-One (1:1) between the "match_results" and "match_stats" tables, because each match result corresponds to one match stats.

* One-to-Many (1:N) between the "match_stats" and "team_info" tables, because a team can have many match stats, but each match stat belongs to one team.

![alt text](img/ER_diagram.PNG)

<a id="subsection-six"></a>
##### Normalizing the dataset

In [381]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2350 entries, 0 to 2349
Data columns (total 24 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   match_id   2350 non-null   object        
 1   team_id    2350 non-null   object        
 2   match_up   2350 non-null   object        
 3   game_date  2350 non-null   datetime64[ns]
 4   result     2350 non-null   object        
 5   min        2350 non-null   int64         
 6   pts        2350 non-null   int64         
 7   fgm        2350 non-null   int64         
 8   fga        2350 non-null   int64         
 9   fgp        2350 non-null   float64       
 10  tpm        2350 non-null   int64         
 11  tpa        2350 non-null   int64         
 12  tpp        2350 non-null   float64       
 13  ftm        2350 non-null   int64         
 14  fta        2350 non-null   int64         
 15  ftp        2350 non-null   float64       
 16  oreb       2350 non-null   int64         


In [382]:
# Save all the uniques dates in a list
unique_dates = pd.unique(df['game_date'])

In [383]:
# Rearrenge the order
unique_dates = np.sort(unique_dates)

In [384]:
# We check the data type of the variable, notice its an numpy array
type(unique_dates)

numpy.ndarray

In [385]:
# Ouputs the first unique date in the list "unique_dates"
unique_dates[0]

numpy.datetime64('2022-10-18T00:00:00.000000000')

In [386]:
# Create a dataframe with columns "date_id" and "game_date" 
df_gd = pd.DataFrame({'date_id': range(1,len(unique_dates)+1),
                      'game_date': unique_dates})

In [387]:
df_gd.head()

Unnamed: 0,date_id,game_date
0,1,2022-10-18
1,2,2022-10-19
2,3,2022-10-20
3,4,2022-10-21
4,5,2022-10-22


In [388]:
df.head()

Unnamed: 0,match_id,team_id,match_up,game_date,result,min,pts,fgm,fga,fgp,...,fta,ftp,oreb,dreb,reb,ast,tov,stl,blk,pf
0,04022023ATL,ATL,DAL-ATL,2023-04-02,W,53,132,51,108,47.2,...,22,81.8,16,37,53,28,11,10,3,22
1,04022023CHA,CHA,TOR-CHA,2023-04-02,L,48,108,42,85,49.4,...,13,69.2,10,27,37,26,18,3,4,11
2,04022023MIL,PHI,PHI-MIL,2023-04-02,L,48,104,40,87,46.0,...,13,92.3,11,25,36,19,11,3,2,17
3,04022023MIN,POR,POR-MIN,2023-04-02,W,48,107,43,93,46.2,...,20,60.0,11,31,42,29,10,12,3,26
4,04022023MIL,MIL,PHI-MIL,2023-04-02,W,48,117,46,80,57.5,...,21,71.4,7,35,42,28,12,8,5,17


In [389]:
# Merge the two dataframes on the column "game_date" of df
df = pd.merge(df, df_gd, on='game_date')

In [390]:
df.head()

Unnamed: 0,match_id,team_id,match_up,game_date,result,min,pts,fgm,fga,fgp,...,ftp,oreb,dreb,reb,ast,tov,stl,blk,pf,date_id
0,04022023ATL,ATL,DAL-ATL,2023-04-02,W,53,132,51,108,47.2,...,81.8,16,37,53,28,11,10,3,22,158
1,04022023CHA,CHA,TOR-CHA,2023-04-02,L,48,108,42,85,49.4,...,69.2,10,27,37,26,18,3,4,11,158
2,04022023MIL,PHI,PHI-MIL,2023-04-02,L,48,104,40,87,46.0,...,92.3,11,25,36,19,11,3,2,17,158
3,04022023MIN,POR,POR-MIN,2023-04-02,W,48,107,43,93,46.2,...,60.0,11,31,42,29,10,12,3,26,158
4,04022023MIL,MIL,PHI-MIL,2023-04-02,W,48,117,46,80,57.5,...,71.4,7,35,42,28,12,8,5,17,158


In [391]:
# Drop the column "game_date"
df = df.drop('game_date', axis=1)

In [392]:
# Create a sub dataframe with specific columns from dataframe "df_merged"
df_mr = df[['match_id', 'team_id', 'date_id', 'result','min', 'pts']]

In [393]:
df_mr.head()

Unnamed: 0,match_id,team_id,date_id,result,min,pts
0,04022023ATL,ATL,158,W,53,132
1,04022023CHA,CHA,158,L,48,108
2,04022023MIL,PHI,158,L,48,104
3,04022023MIN,POR,158,W,48,107
4,04022023MIL,MIL,158,W,48,117


In [394]:
# Reverse the order of rows
df_mr = df_mr.iloc[::-1]

In [395]:
# Reset the index
# The upper and this cell can also be achieved in one line --> df_mr = df_mr[::-1].reset_index(drop=True)
df_mr = df_mr.reset_index(drop=True)

In [396]:
df_mr.head()

Unnamed: 0,match_id,team_id,date_id,result,min,pts
0,10182022GSW,GSW,1,W,48,123
1,10182022GSW,LAL,1,L,48,109
2,10182022BOS,PHI,1,L,48,117
3,10182022BOS,BOS,1,W,48,126
4,10192022SAS,CHA,2,W,48,129


In [397]:
df.head()

Unnamed: 0,match_id,team_id,match_up,result,min,pts,fgm,fga,fgp,tpm,...,ftp,oreb,dreb,reb,ast,tov,stl,blk,pf,date_id
0,04022023ATL,ATL,DAL-ATL,W,53,132,51,108,47.2,12,...,81.8,16,37,53,28,11,10,3,22,158
1,04022023CHA,CHA,TOR-CHA,L,48,108,42,85,49.4,15,...,69.2,10,27,37,26,18,3,4,11,158
2,04022023MIL,PHI,PHI-MIL,L,48,104,40,87,46.0,12,...,92.3,11,25,36,19,11,3,2,17,158
3,04022023MIN,POR,POR-MIN,W,48,107,43,93,46.2,9,...,60.0,11,31,42,29,10,12,3,26,158
4,04022023MIL,MIL,PHI-MIL,W,48,117,46,80,57.5,10,...,71.4,7,35,42,28,12,8,5,17,158


In [398]:
# Drop "match_up" column 
df = df.drop('match_up', axis=1)

In [399]:
# Preview the columns we have by now 
df.columns

Index(['match_id', 'team_id', 'result', 'min', 'pts', 'fgm', 'fga', 'fgp',
       'tpm', 'tpa', 'tpp', 'ftm', 'fta', 'ftp', 'oreb', 'dreb', 'reb', 'ast',
       'tov', 'stl', 'blk', 'pf', 'date_id'],
      dtype='object')

In [400]:
# Rearrange the columns
df = df.reindex(columns=['match_id', 'team_id','date_id', 'fgm', 'fga', 'fgp', 'tpm',
       'tpa', 'tpp', 'ftm', 'fta', 'ftp', 'oreb', 'dreb', 'reb', 'ast', 'tov',
       'stl', 'blk', 'pf'])

In [401]:
df.head()

Unnamed: 0,match_id,team_id,date_id,fgm,fga,fgp,tpm,tpa,tpp,ftm,fta,ftp,oreb,dreb,reb,ast,tov,stl,blk,pf
0,04022023ATL,ATL,158,51,108,47.2,12,35,34.3,18,22,81.8,16,37,53,28,11,10,3,22
1,04022023CHA,CHA,158,42,85,49.4,15,31,48.4,9,13,69.2,10,27,37,26,18,3,4,11
2,04022023MIL,PHI,158,40,87,46.0,12,36,33.3,12,13,92.3,11,25,36,19,11,3,2,17
3,04022023MIN,POR,158,43,93,46.2,9,30,30.0,12,20,60.0,11,31,42,29,10,12,3,26
4,04022023MIL,MIL,158,46,80,57.5,10,28,35.7,15,21,71.4,7,35,42,28,12,8,5,17


In [402]:
team_info = [
    ('ATL','Atlanta Hawks', 'State Farm Arena', 33.7573, -84.3963),
    ("BOS",'Boston Celtics', 'TD Garden', 42.3662, -71.0621),
    ("BKN",'Brooklyn Nets', 'Barclays Center', 40.6826, -73.9754),
    ("CHA",'Charlotte Hornets', 'Spectrum Center', 35.2251, -80.8392),
    ("CHI",'Chicago Bulls', 'United Center', 41.8807, -87.6742),
    ("CLE",'Cleveland Cavaliers', 'Rocket Mortgage FieldHouse', 41.4965, -81.688),
    ("DAL",'Dallas Mavericks', 'American Airlines Center', 32.7906, -96.8101),
    ("DEN",'Denver Nuggets', 'Ball Arena', 39.7487, -105.0077),
    ("DET",'Detroit Pistons', 'Little Caesars Arena', 42.3426, -83.0554),
    ("GSW",'Golden State Warriors', 'Chase Center', 37.768, -122.3862),
    ("HOU",'Houston Rockets', 'Toyota Center', 29.7508, -95.3621),
    ("IND",'Indiana Pacers', 'Bankers Life Fieldhouse', 39.7639, -86.1555),
    ("LAC",'Los Angeles Clippers', 'Staples Center', 34.043, -118.2673),
    ("LAL",'Los Angeles Lakers', 'Staples Center', 34.043, -118.2673),
    ("MEM",'Memphis Grizzlies', 'FedExForum', 35.1381, -90.0507),
    ("MIA",'Miami Heat', 'AmericanAirlines Arena', 25.7814, -80.187),
    ("MIL",'Milwaukee Bucks', 'Fiserv Forum', 43.0451, -87.9173),
    ("MIN",'Minnesota Timberwolves', 'Target Center', 44.9795, -93.2768),
    ("NOP",'New Orleans Pelicans', 'Smoothie King Center', 29.9489, -90.0812),
    ("NYK",'New York Knicks', 'Madison Square Garden', 40.7505, -73.9934),
    ("OKC",'Oklahoma City Thunder', 'Paycom Center', 35.4634, -97.5151),
    ("ORL",'Orlando Magic', 'Amway Center', 28.5392, -81.3839),
    ("PHI",'Philadelphia 76ers', 'Wells Fargo Center', 39.9012, -75.1719),
    ("PHO",'Phoenix Suns', 'Footprint Center', 33.4457, -112.0712),
    ("POR",'Portland Trail Blazers', 'Moda Center', 45.5316, -122.666),
    ("SAC",'Sacramento Kings', 'Golden 1 Center', 38.5802, -121.4991),
    ("SAS",'San Antonio Spurs', 'AT&T Center', 29.4271, -98.4375),
    ("TOR",'Toronto Raptors', 'Scotiabank Arena', 43.6435, -79.3791),
    ("UTA",'Utah Jazz', 'Vivint Arena', 40.7683, -111.9011),
    ("WAS",'Washington Wizards', 'Capital One Arena', 38.898, -77.0209)
]

In [403]:
df_ti= pd.DataFrame(team_info, columns=['team_id', 'team_name', 'arena_name','latitude','longitude'])

In [404]:
df_ti.head()

Unnamed: 0,team_id,team_name,arena_name,latitude,longitude
0,ATL,Atlanta Hawks,State Farm Arena,33.7573,-84.3963
1,BOS,Boston Celtics,TD Garden,42.3662,-71.0621
2,BKN,Brooklyn Nets,Barclays Center,40.6826,-73.9754
3,CHA,Charlotte Hornets,Spectrum Center,35.2251,-80.8392
4,CHI,Chicago Bulls,United Center,41.8807,-87.6742


<a id="section-four"></a>
### Data integrity
We will examine each dataframe and make any necessary changes to ensure that the columns have the correct datatypes. For instance, if a column containing dates is stored as a string, we will convert it to a datetime datatype. Similarly, if a numerical column is stored as a string, we will convert it to an integer or float datatype as appropriate.

By performing this data integrity check, we can be confident that our data is clean and ready for analysis. This will help ensure the accuracy of our results and prevent any potential issues that could arise from incorrect datatypes. Additionally, ensuring that the data types are correct is important for storing the data in a database and running queries. Databases have specific data types, and if the data types in the dataframe are not aligned with those in the database, it can cause errors or unexpected results. By verifying the data types before storing the data in a database, we can ensure that the data is accurately represented and avoid potential issues when running queries later on.

In [405]:
df.dtypes

match_id     object
team_id      object
date_id       int64
fgm           int64
fga           int64
fgp         float64
tpm           int64
tpa           int64
tpp         float64
ftm           int64
fta           int64
ftp         float64
oreb          int64
dreb          int64
reb           int64
ast           int64
tov           int64
stl           int64
blk           int64
pf            int64
dtype: object

In [406]:
df_ti.dtypes

team_id        object
team_name      object
arena_name     object
latitude      float64
longitude     float64
dtype: object

In [407]:
df_mr.dtypes

match_id    object
team_id     object
date_id      int64
result      object
min          int64
pts          int64
dtype: object

In [408]:
df_gd.dtypes

date_id               int64
game_date    datetime64[ns]
dtype: object

In [409]:
# Save our dataframes to csv
# df.to_csv('./data/match_stats.csv', sep=";", index=False)
# df_gd.to_csv('./data/game_dates.csv', sep=";", index=False)
# df_mr.to_csv('./data/match_info.csv', sep=";", index=False)
# df_ti.to_csv('./data/team_info.csv', sep=";", index=False)

### Further reading
* https://pandas.pydata.org/docs/user_guide/index.html#user-guide - Documentation for pandas.
* https://numpy.org/doc/ - Documentation for numpy.
* https://www.oracle.com/database/what-is-a-relational-database/ - Good description of relational databases.
* https://www.freecodecamp.org/news/database-normalization-1nf-2nf-3nf-table-examples/ - Article about Database Normalization