# Data Transformation I

Next, we will begin transforming our dataset by dropping values. Our primary goal of this process is to:

* drop rows with missing data
* drop select columns with overwhelmingly missing data

Utilize the documentation provided in each code-block. When you are done with this section of the project, validate that your output matches the screenshot provided in the `docs/part2.md` file.

In [2]:
import pandas as pd
import numpy as np

In [3]:
# TODO: load `data/raw/shopping.csv` as a pandas dataframe

df = pd.read_csv('../data/raw/shopping.csv')

In [31]:
# TODO: print out the shape of this dataframe for better clarity
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html

df.shape

(3900, 15)

In [32]:
# TODO: display how many null values are in each column of this dataframe
# Documentation: https://datatofish.com/count-nan-pandas-dataframe/

df.isna().sum()

Customer ID                  0
Age                        390
Gender                       0
Item Purchased               0
Purchase Amount (USD)        0
Location                   390
Size                         0
Color                        0
Season                       0
Review Rating             2469
Shipping Type                0
Promo Code Used              0
Previous Purchases           0
Payment Method               0
Frequency of Purchases    2340
dtype: int64

In [33]:
# TODO: it looks like there is roughly 65% of data missing "Frequency of Purchases". Drop this column, as it is mostly empty and unneeded for our analysis.
# In addition, also drop "Customer ID" as this column is also unnecessary
# Reassign this dropped dataframe as a new variable
# Documentation: drive.google.com/drive/folders/1pAWY1JqIQw26uhtT272AoDDeq7jtbkm2

new_df = df.drop(["Frequency of Purchases", "Customer ID"], axis = 1)
new_df.head()

Unnamed: 0,Age,Gender,Item Purchased,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Shipping Type,Promo Code Used,Previous Purchases,Payment Method
0,,Male,Jacket,30.904467,Maine,M,Burnt orange,Fall,4.0,Standard,No,0,Credit Card
1,21.0,Female,Backpack,31.588259,,L,Turquoise,Winter,2.0,Express,No,1,Credit Card
2,31.0,Male,Leggings,24.231704,Nevada,M,Terra cotta,Winter,4.0,Standard,No,0,Credit Card
3,,Male,Pajamas,33.918834,Nebraska,M,Black,Winter,,Standard,No,2,Credit Card
4,38.0,Male,Sunglasses,36.545487,Oregon,S,Aubergine,Summer,,Standard,No,0,Credit Card


In [34]:
# TODO: print out the shape of this dataframe and verify that the shape is "(3900, 13)"

new_df.shape

(3900, 13)

In [91]:
# TODO: while "Review Rating" is also mostly empty, we are interested in figuring out why some users
# leave reviews and others don't. 

# Therefore we will NOT drop this column. Instead, let's reassign 
# all missing values in "Review Rating" with "Missing", and all non-na values as "Present"
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html

# print(new_df["Review Rating"].fillna("Missing"))
print(list(filter(lambda x: x!=2, [1, 2, 3, 4, 5])))
print(new_df["Review Rating"])
print(new_df["Review Rating"].replace([1.0, 2.0, 3.0, 4.0, 5]))
print(new_df["Review Rating"].fillna("Missing").replace(list(filter(lambda x: x != np.nan, new_df["Review Rating"])), "Present"))
print(new_df["Review Rating"].fillna("Missing").replace([float(x) for x in range(0, 10, 1)], "Present"))
# print(new_df["Review Rating"].notna())
# new_df["Review Rating"].fillna("Missing").replace(to_replace=new_df["Review Rating"].notna(), value=["Present", "Missing"])
# new_df["Review Rating"].replace({0})

[1, 3, 4, 5]
0       4.0
1       2.0
2       4.0
3       NaN
4       NaN
       ... 
3895    NaN
3896    4.0
3897    NaN
3898    NaN
3899    4.0
Name: Review Rating, Length: 3900, dtype: float64
0       Present
1       Present
2       Present
3       Missing
4       Missing
         ...   
3895    Missing
3896    Present
3897    Missing
3898    Missing
3899    Present
Name: Review Rating, Length: 3900, dtype: object
0       Present
1       Present
2       Present
3       Missing
4       Missing
         ...   
3895    Missing
3896    Present
3897    Missing
3898    Missing
3899    Present
Name: Review Rating, Length: 3900, dtype: object


In [None]:
# TODO: Now that we've dropped and transformed our columns, drop the remaining rows that contain missing values
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html

...

In [None]:
# TODO: display how many null values are in each column of this dataframe
# validate that each column has no missing values

...

In [None]:
# TODO: print out the shape of this dataframe and verify that the shape is "(3158, 13)"

...

In [None]:
# TODO: print out the first 5 rows of this dataframe for validation

...

In [None]:
# TODO: write this newly transformed dataset to the `data/processed` folder. Name it "shopping_cleaned.csv" 
# Be sure to not include an additional index when writing this csv file
# Documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

...