# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [52]:
# Import pandas and any other libraries you need here.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create a new dataframe from your 
womans_clothing_ec_reviews = pd.read_csv("C:/Users/johar/LaunchCode/data-analysis-projects/Womens Clothing E-Commerce Reviews.csv")

In [51]:
# Print out any information you need to understand your dataframe
print(womans_clothing_ec_reviews)

       Unnamed: 0  Clothing ID  Age  \
0               0          767   33   
1               1         1080   34   
2               2         1077   60   
3               3         1049   50   
4               4          847   47   
...           ...          ...  ...   
23481       23481         1104   34   
23482       23482          862   48   
23483       23483         1104   31   
23484       23484         1084   28   
23485       23485         1104   52   

                                                   Title  \
0                                                    NaN   
1                                                    NaN   
2                                Some major design flaws   
3                                       My favorite buy!   
4                                       Flattering shirt   
...                                                  ...   
23481                     Great dress for many occasions   
23482                         Wish it was made of c

## Missing Data

Try out different methods to locate and resolve missing data.

In [45]:
# Try to find some missing data!
womans_clothing_ec_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  int64 
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: int64(6), object(5)
memory usage: 2.0+ MB


Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here:
# I did find some missing data using isna(). That only displays the first few and last few rows though, so I couldn't get an accurate null count for
# all columns. So I decided to use .info() to see which columns had nulls and how many each column had.


## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [58]:
# Keep an eye out for outliers!
womans_clothing_ec_reviews.describe()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,23486.0,23486.0,23486.0,23486.0,23486.0,23486.0
mean,11742.5,918.118709,43.198544,4.196032,0.822362,2.535936
std,6779.968547,203.29898,12.279544,1.110031,0.382216,5.702202
min,0.0,0.0,18.0,1.0,0.0,0.0
25%,5871.25,861.0,34.0,4.0,1.0,0.0
50%,11742.5,936.0,41.0,5.0,1.0,1.0
75%,17613.75,1078.0,52.0,5.0,1.0,3.0
max,23485.0,1205.0,99.0,5.0,1.0,122.0


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here: I ran the describe() function to see the quartiles on all the columns with integers. I also ran plot.hist and plot.scatter
# on a few columns to see how the data looks. There are some higher age numbers but I don't feel that anything is an outlier.

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [74]:
# Look out for unnecessary data!
womans_clothing_ec_reviews.info()
womans_clothing_ec_reviews.drop(['Review Text', 'Title', 'Positive Feedback Count'], axis = 1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 11 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Unnamed: 0               23486 non-null  int64 
 1   Clothing ID              23486 non-null  int64 
 2   Age                      23486 non-null  int64 
 3   Title                    19676 non-null  object
 4   Review Text              22641 non-null  object
 5   Rating                   23486 non-null  int64 
 6   Recommended IND          23486 non-null  bool  
 7   Positive Feedback Count  23486 non-null  int64 
 8   Division Name            23472 non-null  object
 9   Department Name          23472 non-null  object
 10  Class Name               23472 non-null  object
dtypes: bool(1), int64(5), object(5)
memory usage: 1.8+ MB


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Division Name,Department Name,Class Name
0,0,767,33,4,True,Initmates,Intimate,Intimates
1,1,1080,34,5,True,General,Dresses,Dresses
2,2,1077,60,3,False,General,Dresses,Dresses
3,3,1049,50,5,True,General Petite,Bottoms,Pants
4,4,847,47,5,True,General,Tops,Blouses
...,...,...,...,...,...,...,...,...
23481,23481,1104,34,5,True,General Petite,Dresses,Dresses
23482,23482,862,48,3,True,General Petite,Tops,Knits
23483,23483,1104,31,3,False,General Petite,Dresses,Dresses
23484,23484,1084,28,3,True,General,Dresses,Dresses


Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here. Yes, I think that postive feedback count, review text, and title are all unnecessary data. I deleted those columns using .drop(). 

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [72]:
# Look out for inconsistent data!
womans_clothing_ec_reviews['Recommended IND'] = womans_clothing_ec_reviews['Recommended IND'].astype('bool')

Did you find any inconsistent data? What did you do to clean it?

In [73]:
# Make your notes here! If you chose to keep Recommended IND then that should be converted to a Boolean. I converted that column to Boolean from int64.

       Unnamed: 0  Clothing ID  Age  \
0               0          767   33   
1               1         1080   34   
2               2         1077   60   
3               3         1049   50   
4               4          847   47   
...           ...          ...  ...   
23481       23481         1104   34   
23482       23482          862   48   
23483       23483         1104   31   
23484       23484         1084   28   
23485       23485         1104   52   

                                                   Title  \
0                                                    NaN   
1                                                    NaN   
2                                Some major design flaws   
3                                       My favorite buy!   
4                                       Flattering shirt   
...                                                  ...   
23481                     Great dress for many occasions   
23482                         Wish it was made of c