# Cleaning The Data

Made by: Alexander Beaucage

Date: June 23 2023

Contact Info: Beaucagealex202@gmail.com

The goal of this notebook is to clean up the data set and get it ready for making a recommendation system.

In [None]:
# Imports
import numpy as np
import pandas as pd

In [None]:
# Getting the directory of the data to read it into pandas
datadir = r"../csv_files/dataset.csv"

# Reading in the data
data = pd.read_csv(datadir)

In [None]:
# What is with the extra columns????
data.head()
# I was expecting only four columns but got an extra 117 columns.

In [None]:
# How many rows and columns?
data.shape
# 1048575 rows
# 121 columns(Should only be 4)

Some weird stuff going on with the data.

First off is the extra columns.

Then there is the quotes around `artistname`, `trackname`, and `playlistname`.

Here is the plan of action for removing the extra columns
- Find and drop the rows with anything in the extra 120 columns
 1. Create a boolean mask for the rows with anything in the extra columns.
 2. Drop the things in that maks or even make a new data frame.
 
 
- Then drop the extra 120 columns
 1. drop the extra columns

In [None]:
# Create a boolean mask for if there is any entry in the extra columns
# The .all makes sure to combine the booleans into a single row insted of a array of booleans
boolmask = data.iloc[:,4:,].isna().all(axis = 1)

This seems backwards, but if the value is `True` that means that there is nothing in the extra columns for that row.

In [None]:
# Make sure the boolean mask looks right
boolmask

In [None]:
# Remove the rows with data in the extra columns
data.drop(boolmask[boolmask == False].index, inplace= True)

In [None]:
# Make data equal to the first 4 columns of data.
# Effectivly removing the extra columns
data = data.iloc[:,0:4]

In [None]:
# Take a look at the data to make sure it look right.
data.sample(5)

Now it's time to remove those weird quotes.

In [None]:
# See what the column names actualy are.
data.columns

In [None]:
# Change the column names to use be able to use them later.
data.rename(columns = {' "artistname"': 'artistname',
                       ' "trackname"': 'trackname', 
                       ' "playlistname"': 'playlistname'},
            inplace= True)

In [None]:
# Take a look to see if it worked
data.head()

In [None]:
# Try to check the artistname column
data["artistname"]

Now it's time to see if there is nulls in the data

In [None]:
# Get the info on the data
data.info()

Looks like there is some nulls in the data.

In [None]:
# Get the number of rows in the data
rows = data.shape[0]
# Get the number of nulls in the user_id column
nulls_user_id = rows - (rows - data["user_id"].isna().sum())
# Get the number of nulls in the artistname column
nulls_artist_name = rows - (rows - data["artistname"].isna().sum())
# Get the number of nulls in the trackname column
nulls_track_name = rows - (rows - data["trackname"].isna().sum())
# Get the number of nulls in the playlistname column
nulls_playlist_name = rows - (rows - data["playlistname"].isna().sum())

# Print off some useful info
print(f"There is {nulls_user_id} nulls in user_id \n")
print(f"There is {nulls_artist_name} nulls in artistname \n")
print(f"There is {nulls_track_name} nulls in trackname \n")
print(f"There is {nulls_playlist_name} nulls in playlistname \n")
print(f"There is {nulls_user_id+nulls_artist_name+nulls_track_name+nulls_playlist_name} all together")

So, all together there is 2285 nulls in the data. And most of the nulls are in the artist name column. It would be very difficult to impute the data (mostly time consuming). For that reason I will be dropping the rows with nulls in them.

In [None]:
# Drop all the rows with nulls
data.dropna(inplace = True)

In [None]:
# Get the info on the data
data.info()

Nice and clean data!

Now to save the data as a .csv file.

In [None]:
# This is the name of the data 
dataname = r"cleandata.csv"

# This is the directory of where im going to save the data
datadir = r"../csv_files/"

# Saves the data to the directory specified 
data.to_csv(datadir+dataname)