# EDA dataset Grammy

### This EDA aims to clean and analyze the dataframe: 'the_grammy_awards'. We obtained it from the Kaggle website, and this DataFrame is stored locally in a PostgreSQL database.

Import of the necessary libraries

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sqlalchemy import create_engine, inspect
from dotenv import load_dotenv
import os


Get connection to data base

In [27]:
load_dotenv()

localhost = os.getenv('LOCALHOST')
port = os.getenv('PORT')
nameDB = os.getenv('DB_NAME')
userDB = os.getenv('DB_USER')
passDB = os.getenv('DB_PASS')

try:
    engine = create_engine(f'postgresql+psycopg2://{userDB}:{passDB}@{localhost}:{port}/{nameDB}')
    inspector = inspect(engine)
    
    connection = engine.connect()
    print("Successfully connected to the database.")
    
    connection.close()

except Exception as e:
    print(f"Failed to connect to the database: {e}")

Successfully connected to the database.


We retrieve the data from the database and then print them to take a look at the data types and columns it contains.

In [32]:
dataframe = 'grammy_awards'  

df = pd.read_sql_table(dataframe, engine)

print(df.head())
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4810 entries, 0 to 4809
Data columns (total 10 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   year          4810 non-null   int64 
 1   title         4810 non-null   object
 2   published_at  4810 non-null   object
 3   updated_at    4810 non-null   object
 4   category      4810 non-null   object
 5   nominee       4804 non-null   object
 6   artist        2970 non-null   object
 7   workers       2620 non-null   object
 8   img           3443 non-null   object
 9   winner        4810 non-null   bool  
dtypes: bool(1), int64(1), object(8)
memory usage: 343.0+ KB
None


Now we are going to check the data type of each column.

In [34]:
print(df.dtypes)

year             int64
title           object
published_at    object
updated_at      object
category        object
nominee         object
artist          object
workers         object
winner            bool
dtype: object


We delete the 'img' column since it only contains links that we will not access, and it doesn't provide any value or meaning to the dataframe.

In [33]:
df = df.drop(columns="img")
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4810 entries, 0 to 4809
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   year          4810 non-null   int64 
 1   title         4810 non-null   object
 2   published_at  4810 non-null   object
 3   updated_at    4810 non-null   object
 4   category      4810 non-null   object
 5   nominee       4804 non-null   object
 6   artist        2970 non-null   object
 7   workers       2620 non-null   object
 8   winner        4810 non-null   bool  
dtypes: bool(1), int64(1), object(7)
memory usage: 305.5+ KB
None


Now let see how many null values we have

In [20]:
null_data = df.isnull().sum()
print(null_data)

year               0
title              0
published_at       0
updated_at         0
category           0
nominee            6
artist          1840
workers         2190
winner             0
dtype: int64


We delete the rows that contain null values in the 'artist' column because they are important data for the analysis, and without them, there is no value. This will help keep this column free of nulls to create better graphs during the analysis.

In [23]:
df = df.dropna(subset=["artist"])
null_data = df.isnull().sum()
print(null_data)

year               0
title              0
published_at       0
updated_at         0
category           0
nominee            0
artist             0
workers         2004
winner             0
dtype: int64


In [25]:
df['workers'].fillna('Unknown', inplace=True)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 2970 entries, 0 to 4803
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   year          2970 non-null   int64 
 1   title         2970 non-null   object
 2   published_at  2970 non-null   object
 3   updated_at    2970 non-null   object
 4   category      2970 non-null   object
 5   nominee       2970 non-null   object
 6   artist        2970 non-null   object
 7   workers       2970 non-null   object
 8   winner        2970 non-null   bool  
dtypes: bool(1), int64(1), object(7)
memory usage: 211.7+ KB
None


In [None]:
# try:
#     df.to_sql('grammy_awards', engine, if_exists='replace', index=False)

#     print(f"Table 'grammy_awards' updated.")

# except Exception as e:
#     print(f"Error uploading data: {e}")

# finally:
#     engine.dispose()