# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [1]:
# Import pandas and any other libraries you need here.
import pandas as pd
import matplotlib as plt
import numpy as np

# Create a new dataframe from your CSV

df = pd.read_csv(r'C:\Users\Jacob\Desktop\Kaggle datasets\Womens.csv')

In [25]:
# Print out any information you need to understand your dataframe
# df.info()
df.sample(5)

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
22527,22527,895,46,So wanted to love this but...,I was expecting elbow-to-full-length sleeves a...,2,0,2,General,Tops,Fine gauge
9688,9688,865,26,One of my favorite tops...,The material is super soft and the color is sp...,5,1,0,General,Tops,Knits
4241,4241,1068,60,Cargo pants with style and comfort!,Like these pants and they were on sale--does n...,5,1,0,General Petite,Bottoms,Pants
6289,6289,593,23,Gorgeous and socially conscious,I was very hesitant about this. i read the pre...,5,1,0,General,Trend,Trend
13186,13186,1078,43,Easy to wear dress,Purchased the blue motif colorway which (as ot...,5,1,3,General,Dresses,Dresses


## Missing Data

Try out different methods to locate and resolve missing data.

In [18]:
# Try to find some missing data!
# df.isnull()
# df.isna()
df.isna().sum()

Unnamed: 0                    0
Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here:
# I found missing data using .isnull and isna(), but these would take a while to scan through and check everything.
# i had much better luck with isna().sum() which counted all null values for each column.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [1]:
# Keep an eye out for outliers!
# df.plot.scatter(x= "Age", y = 'Positive Feedback Count')

NameError: name 'df' is not defined

What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here:
# I used a matplotlib scatter chart to find possible outliers. Having the visualization of the scatter plot can distinctly show outliers 
# in a dataframe.

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [19]:
# Look out for unnecessary data!
df.drop(['Unnamed: 0'], axis = 1)

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses
...,...,...,...,...,...,...,...,...,...,...
23481,1104,34,Great dress for many occasions,I was very happy to snag this dress at such a ...,5,1,0,General Petite,Dresses,Dresses
23482,862,48,Wish it was made of cotton,"It reminds me of maternity clothes. soft, stre...",3,1,0,General Petite,Tops,Knits
23483,1104,31,"Cute, but see through","This fit well, but the top was very see throug...",3,0,1,General Petite,Dresses,Dresses
23484,1084,28,"Very cute dress, perfect for summer parties an...",I bought this dress for a wedding i have this ...,3,1,2,General,Dresses,Dresses


Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here.
# Yes, there was a column called "Unnamed: 0" which is obiously uneeded. I used df.drop(['Unnamed: 0'], axis = 1) to drop the data.

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [None]:
# Look out for inconsistent data
# i could not find any inconsistent data. 

Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!