# Cleaning, Transforming and Storing your Data
<img src="https://github.com/Minyall/sc207_materials/blob/master/images/python_pandas.jpg?raw=true" align="right">
 Now we know a little about selecting, filtering and sorting our data, we can move on to cleaning and transforming our data too.

 Data is rarely perfect, missing values, anomalous values and duplicates can cause all sorts of issues in analysis. It may also be that data is not interpreted correctly straight away, and that we need to tell Pandas what kind of data it's looking at.

 Once we've got on top of that we can also explore how Pandas powerful summarisation tools can help us understand our data better.

[__Pandas Documentation__](http://pandas.pydata.org/pandas-docs/stable/)




In [24]:
import pandas as pd
filename = 'spotify_top_songs.csv'
songs_df = pd.read_csv(filename)
songs_df.head()

Unnamed: 0,track_id,track_name,artists,genre,release_year,release_date,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,5mjYQaktjmjcMKcUIcqz4s,Strangers,Kenya Grace,singer-songwriter pop,2023,2023-09-01,False,97,172964,Top 50 - United Kingdom,0.628,-8.307,,mixed_pop
1,56y1jOTK0XSvJzVv9vHQBK,Paint The Town Red,Doja Cat,dance pop,2023,2023-09-20,True,87,230480,Top 50 - United Kingdom,0.864,-7.683,0.194,mixed_pop
2,1reEeZH9wNt4z1ePYLyC7p,greedy,Tate McRae,alt z,2023,2023-09-13,True,31,131872,Top 50 - United Kingdom,0.75,-3.19,0.0322,mixed_pop
3,59NraMJsLaMCVtwXTSia8i,Prada,cassö,***OOPS!***,2023,2023-08-11,True,94,132359,Top 50 - United Kingdom,0.638,-5.804,0.0375,mixed_pop
4,5aIVCx5tnk0ntmdiinnYvw,Water,Tyla,***OOPS!***,2023,2023-07-28,False,91,200255,Top 50 - United Kingdom,0.673,-3.495,0.0755,mixed_pop


## Data Cleaning
Data cleaning can involve a range of techniques, but the unifying goal is to get your data into a state that is ready for analysis. This could include:
- Removing rows where data is missing
- Replacing missing data with another value.
- Replacing data that may be oddly formatted to make it more analysis compatible.
- Transforming the type of data in a column to correct mistakes or to make it more useful.


If we examine the `.info()` we can quickly identify if there are missing values that Pandas knows about by comparing the total entries with the 'Non-Null Count' for each column.

### Dropping and filling missing data

In [25]:
songs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1280 entries, 0 to 1279
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   track_id       1280 non-null   object 
 1   track_name     1280 non-null   object 
 2   artists        1280 non-null   object 
 3   genre          1280 non-null   object 
 4   release_year   1280 non-null   int64  
 5   release_date   1280 non-null   object 
 6   explicit       1280 non-null   bool   
 7   popularity     1280 non-null   int64  
 8   duration_ms    1280 non-null   int64  
 9   playlist_name  1280 non-null   object 
 10  danceability   1280 non-null   float64
 11  loudness       1280 non-null   float64
 12  speechiness    1279 non-null   float64
 13  playlist_type  1280 non-null   object 
dtypes: bool(1), float64(3), int64(3), object(7)
memory usage: 131.4+ KB


We can identify which row is missing that value for speechiness using a special filter called `.isna()`

In [26]:
songs_df[songs_df['speechiness'].isna()]

Unnamed: 0,track_id,track_name,artists,genre,release_year,release_date,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
0,5mjYQaktjmjcMKcUIcqz4s,Strangers,Kenya Grace,singer-songwriter pop,2023,2023-09-01,False,97,172964,Top 50 - United Kingdom,0.628,-8.307,,mixed_pop


There are multiple approaches to missing data, depending on your analysis. The simplest approach is to simply drop any rows that have any missing data. `.dropna()` will do this for us, returning a version of the dataframe where every row has a value for every column. 

If you only want to drop rows with a missing value in a specific column(s) you can use the `subset=` argument. You must pass it a list of column names, even when only checking one column.

We can see that the total number of rows is now one less than the original dataframe.

In [27]:
songs_df.dropna(subset=['speechiness']).info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1279 entries, 1 to 1279
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   track_id       1279 non-null   object 
 1   track_name     1279 non-null   object 
 2   artists        1279 non-null   object 
 3   genre          1279 non-null   object 
 4   release_year   1279 non-null   int64  
 5   release_date   1279 non-null   object 
 6   explicit       1279 non-null   bool   
 7   popularity     1279 non-null   int64  
 8   duration_ms    1279 non-null   int64  
 9   playlist_name  1279 non-null   object 
 10  danceability   1279 non-null   float64
 11  loudness       1279 non-null   float64
 12  speechiness    1279 non-null   float64
 13  playlist_type  1279 non-null   object 
dtypes: bool(1), float64(3), int64(3), object(7)
memory usage: 141.1+ KB


As `.dropna()` returns a version of the dataframe without the offending rows, if we want to continue working with the cleaned version, we simply overwrite the variable with the new dataframe.

This is shown below as a comment as we don't want to actually do that just yet!

In [28]:
# songs_df = songs_df.dropna()

If we wanted to keep the row, we could instead replace the missing value with another value such as the average value for that column. There are other ways to generate replacement data but they have their issues. In general it is usually better to drop these rows unless you absolutely have to keep them.

Again to save the transformed result we overwrite, but this time we overwrite the specific column in the dataframe. Shown below as a comment to avoid committing changes.

In [29]:
# avg_speechiness = songs_df['speechiness'].mean()
# songs_df['speechiness'] = songs_df['speechiness'].fillna(avg_speechiness)

### When missing data doesn't look missing
Sometimes datasets can fool you into thinking they're more complete than they are. According to `.info()` there are no missing values in the `genre` column. However if we look at the data we can see an odd value called `***OOPS!***`. This looks like a placeholder value entered if data collection went wrong. 

We can replace this with a `NaN` or `NA`, an object that represents a missing value - when we used `.isna`, `.dropna` and `.fillna` Pandas was specifically looking for these `NA` objects.

First let's check how many of these odd placeholder values we have.

In [30]:
songs_df[songs_df['genre'] == '***OOPS!***'].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 3 to 1210
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   track_id       50 non-null     object 
 1   track_name     50 non-null     object 
 2   artists        50 non-null     object 
 3   genre          50 non-null     object 
 4   release_year   50 non-null     int64  
 5   release_date   50 non-null     object 
 6   explicit       50 non-null     bool   
 7   popularity     50 non-null     int64  
 8   duration_ms    50 non-null     int64  
 9   playlist_name  50 non-null     object 
 10  danceability   50 non-null     float64
 11  loudness       50 non-null     float64
 12  speechiness    50 non-null     float64
 13  playlist_type  50 non-null     object 
dtypes: bool(1), float64(3), int64(3), object(7)
memory usage: 5.5+ KB


We can `.replace()` all these values with `NaN` objects so that we have a clearer picture of our data, and can then have the option to use our other missing data cleaning methods.

In [31]:
songs_df['genre'] = songs_df['genre'].replace('***OOPS!***', pd.NA)
songs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1280 entries, 0 to 1279
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   track_id       1280 non-null   object 
 1   track_name     1280 non-null   object 
 2   artists        1280 non-null   object 
 3   genre          1230 non-null   object 
 4   release_year   1280 non-null   int64  
 5   release_date   1280 non-null   object 
 6   explicit       1280 non-null   bool   
 7   popularity     1280 non-null   int64  
 8   duration_ms    1280 non-null   int64  
 9   playlist_name  1280 non-null   object 
 10  danceability   1280 non-null   float64
 11  loudness       1280 non-null   float64
 12  speechiness    1279 non-null   float64
 13  playlist_type  1280 non-null   object 
dtypes: bool(1), float64(3), int64(3), object(7)
memory usage: 131.4+ KB


Now we have a more accurate representation of our missing values let's go ahead an just drop any row with missing data using `.dropna()`.

In [32]:
songs_df = songs_df.dropna()
songs_df.head()

Unnamed: 0,track_id,track_name,artists,genre,release_year,release_date,explicit,popularity,duration_ms,playlist_name,danceability,loudness,speechiness,playlist_type
1,56y1jOTK0XSvJzVv9vHQBK,Paint The Town Red,Doja Cat,dance pop,2023,2023-09-20,True,87,230480,Top 50 - United Kingdom,0.864,-7.683,0.194,mixed_pop
2,1reEeZH9wNt4z1ePYLyC7p,greedy,Tate McRae,alt z,2023,2023-09-13,True,31,131872,Top 50 - United Kingdom,0.75,-3.19,0.0322,mixed_pop
5,2FDTHlrBguDzQkp7PVj16Q,Sprinter,Dave,uk hip hop,2023,2023-06-01,True,94,229133,Top 50 - United Kingdom,0.916,-8.067,0.241,mixed_pop
6,1BxfuPKGuaTgP7aM0Bbdwr,Cruel Summer,Taylor Swift,pop,2019,2019-08-23,False,99,178426,Top 50 - United Kingdom,0.552,-5.707,0.157,mixed_pop
7,3vkCueOmm7xQDoJ17W1Pm3,My Love Mine All Mine,Mitski,brooklyn indie,2023,2023-09-15,False,93,137773,Top 50 - United Kingdom,0.504,-14.958,0.0321,mixed_pop


Unless you want to retain the index to match back to the original data later, often it is a good idea to `.reset_index()` before continuing. We use `drop=True` to ensure the original index is not retained and just cleaned away entirely.

In [35]:
songs_df = songs_df.reset_index(drop=True)

### Fixing Wrong data types
Sometimes either due to the way data was interpreted when Pandas loaded it, or due to the way data was created, it won't necessarily be the right type of data.

In our dataset we have a `release_year` column, and currently it is listed as an `object`

In [36]:
songs_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1229 entries, 0 to 1228
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   track_id       1229 non-null   object 
 1   track_name     1229 non-null   object 
 2   artists        1229 non-null   object 
 3   genre          1229 non-null   object 
 4   release_year   1229 non-null   int64  
 5   release_date   1229 non-null   object 
 6   explicit       1229 non-null   bool   
 7   popularity     1229 non-null   int64  
 8   duration_ms    1229 non-null   int64  
 9   playlist_name  1229 non-null   object 
 10  danceability   1229 non-null   float64
 11  loudness       1229 non-null   float64
 12  speechiness    1229 non-null   float64
 13  playlist_type  1229 non-null   object 
dtypes: bool(1), float64(3), int64(3), object(7)
memory usage: 126.1+ KB


If we look at the `release_date` value for the first row, we can see it is actually a string, and if we ask Pandas to `.describe()` it to us it can do very little as it thinks they are just words, rather than dates.

In [37]:
songs_df.loc[0,'release_date']

'2023-09-20'

In [39]:
songs_df['release_date'].describe()

count           1229
unique           684
top       2023-09-08
freq              15
Name: release_date, dtype: object

We can recast the column as a date by using `pd.to_datetime()` which takes a column of strings and returns a column of dates.

In [41]:
songs_df['release_date'] = pd.to_datetime(songs_df['release_date'])
songs_df['release_date'].describe(datetime_is_numeric=True)

count                             1229
mean     2001-01-04 20:20:53.702196864
min                1954-01-01 00:00:00
25%                1983-03-23 00:00:00
50%                2008-03-28 00:00:00
75%                2020-05-22 00:00:00
max                2023-10-13 00:00:00
Name: release_date, dtype: object

## Exercises 1
Take a look at section 1 of the exercises sheet. Complete the tasks before moving on.

## Exercises 2
Take a look at section 2 of the exercises sheet. Complete the tasks. 

If there is time, work through the appropriate chapter of the McLevey textbook OR the recommended DataCamp course. 

See Moodle for details.