# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [3]:
# Import pandas and any other libraries you need here.
import pandas as pd
# Create a new dataframe from your CSV

df = pd.read_csv(r"C:\LC\Launchcode 2024\Python\Python_Assignments\cleaning-data-with-pandas\studio\Womens-Clothing-E-Commerce-Reviews.csv", index_col = 0)
df.head(5)

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


In [12]:
# Print out any information you need to understand your dataframe
display(df.info())

display(df.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 23486 entries, 0 to 23485
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              23486 non-null  int64 
 1   Age                      23486 non-null  int64 
 2   Title                    19676 non-null  object
 3   Review Text              22641 non-null  object
 4   Rating                   23486 non-null  int64 
 5   Recommended IND          23486 non-null  int64 
 6   Positive Feedback Count  23486 non-null  int64 
 7   Division Name            23472 non-null  object
 8   Department Name          23472 non-null  object
 9   Class Name               23472 non-null  object
dtypes: int64(5), object(5)
memory usage: 2.0+ MB


None

Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,23486.0,23486.0,23486.0,23486.0,23486.0
mean,918.118709,43.198544,4.196032,0.822362,2.535936
std,203.29898,12.279544,1.110031,0.382216,5.702202
min,0.0,18.0,1.0,0.0,0.0
25%,861.0,34.0,4.0,1.0,0.0
50%,936.0,41.0,5.0,1.0,1.0
75%,1078.0,52.0,5.0,1.0,3.0
max,1205.0,99.0,5.0,1.0,122.0


## Missing Data

Try out different methods to locate and resolve missing data.

In [13]:
# Try to find some missing data!
display(df.isnull().sum())
cols= {col: "None" for col in ["Title", "Review Text"]}
cols.update({col: "Uknown" for col in ["Division Name", "Department Name", "Class Name"]})
df.fillna(value=cols, inplace=True)
display(df.isnull().sum())

Clothing ID                   0
Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64

Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64

Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here:
#isnull() and isna() gave the same amounts. 
# It was ueful to do .sum() to show how many are missing for each column. 
# I then used a dictionary with the fillna to give them values.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [14]:
# Keep an eye out for outliers!
df.describe()
#99 age looks odd and clothing Id = 0 looks odd.
display(df.loc[df["Age"] == 99].shape[0]) #I used this to show me how many values had an age of 99.
display(df.loc[df["Clothing ID"] == 0].shape[0])
display(df.loc[df["Age"] == 99]) #I used this to show me the rows that are 99.
display(df.loc[df["Clothing ID"] == 0])

outliers = df.loc[(df["Age"] == 99) | (df["Clothing ID"] == 0)] #The way the book did it did not work. Though this appeared to work.
df.drop(outliers.index, inplace = True)
display(df.loc[df["Age"] == 99]) 
display(df.loc[df["Clothing ID"] == 0])

2

1

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
8327,1078,99,Beachy and boho!,I love the weight of the material; sometimes c...,5,1,1,General,Dresses,Dresses
11545,949,99,,"Great quality, i didn't expect the neck to be ...",4,1,4,General,Tops,Sweaters


Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
14746,0,26,,,5,1,0,General,Jackets,Outerwear


Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name


Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here: I like describe, but it was also useful to see some examples. I then dropped those irregular data point's rows.

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [None]:
# Look out for unnecessary data!.
#Without a business problem it is hard to tell what is not necessary, but I will assume it is the Dept. Division, and Class columns.
df.drop(columns=["Department Name"], axis= 1, inplace = True)
df.drop(df.columns[-2:], axis=1, inplace = True)
display(df.head(5))

Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here.
# I decided Dept. Division, and Class columns were unecessary and dropped them.

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [15]:
# Look out for inconsistent data!
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
Index: 23483 entries, 0 to 23485
Data columns (total 10 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              23483 non-null  int64 
 1   Age                      23483 non-null  int64 
 2   Title                    23483 non-null  object
 3   Review Text              23483 non-null  object
 4   Rating                   23483 non-null  int64 
 5   Recommended IND          23483 non-null  int64 
 6   Positive Feedback Count  23483 non-null  int64 
 7   Division Name            23483 non-null  object
 8   Department Name          23483 non-null  object
 9   Class Name               23483 non-null  object
dtypes: int64(5), object(5)
memory usage: 2.0+ MB


Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,23483.0,23483.0,23483.0,23483.0,23483.0
mean,918.149683,43.194524,4.195972,0.82234,2.536047
std,203.220885,12.269011,1.110076,0.382235,5.702525
min,1.0,18.0,1.0,0.0,0.0
25%,861.0,34.0,4.0,1.0,0.0
50%,936.0,41.0,5.0,1.0,1.0
75%,1078.0,52.0,5.0,1.0,3.0
max,1205.0,94.0,5.0,1.0,122.0


Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!
#No incosistent data.
