# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

**Dataset Information:**
- **Dataset Name:** Women's Clothing E-Commerce Reviews
- **File:** `Womens Clothing E-Commerce Reviews.csv`
- **Source:** This dataset contains reviews written by customers and includes features like ratings, review text, product categories, and customer information.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [31]:
# Import pandas and any other libraries you need here.
import pandas as pd
# Create a new dataframe from your CSV
df = pd.read_csv('Womens Clothing E-Commerce Reviews.csv')

In [36]:
# Print out any information you need to understand your dataframe
print(df)
df.info()
df.head(5)


       Unnamed: 0  Clothing ID  Age  \
0               0          767   33   
1               1         1080   34   
2               2         1077   60   
3               3         1049   50   
4               4          847   47   
...           ...          ...  ...   
23481       23481         1104   34   
23482       23482          862   48   
23483       23483         1104   31   
23484       23484         1084   28   
23485       23485         1104   52   

                                                   Title  \
0                                                    NaN   
1                                                    NaN   
2                                Some major design flaws   
3                                       My favorite buy!   
4                                       Flattering shirt   
...                                                  ...   
23481                     Great dress for many occasions   
23482                         Wish it was made of c

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


## Missing Data

Try out different methods to locate and resolve missing data.

In [71]:
# Try to find some missing data!
missing_counts = df.isnull().sum()
print(missing_counts)



Clothing ID                0
Age                        0
Title                      0
Review Text                0
Rating                     0
Recommended IND            0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
dtype: int64


Did you find any missing data? What things worked well for you and what did not?

In [24]:
# Respond to the above questions here:

Did you find any missing data? 
-Yes, missing data was found in several columns: 
'Title' (3810 missing), 'Review Text' (845 missing), and the 
categorical columns 'Division Name', 'Department Name', 'Class Name' (14 missing each).

What things worked well?
- Using df.isnull().sum() was effective in quickly identifying 
  which columns contain missing values and how many.
- It gave a clear summary to guide cleaning decisions.

What did not work well?
- isnull() alone does not tell us how to handle missing data, 
  so further decisions need context, such as whether to fill or drop.
- Large missing counts in columns like 'Title' and 'Review Text' may 
  require domain knowledge to decide on imputing or removing rows.

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [56]:
# Keep an eye out for outliers!

Q1 = df['Age'].quantile(0.25)
Q3 = df['Age'].quantile(0.75)
IQR = Q3 - Q1
outliers_age = df[(df['Age'] < (Q1 - 1.5 * IQR)) | (df['Age'] > (Q3 + 1.5 * IQR))]
print("Outliers in Age column:\n", outliers_age[['Age']])

df.describe()

Outliers in Age column:
        Age
95      83
234     83
277     83
628     80
659     93
...    ...
22640   80
22716   87
22773   83
23001   83
23033   86

[109 rows x 1 columns]


Unnamed: 0,Clothing ID,Age,Rating,Recommended IND,Positive Feedback Count
count,23486.0,23486.0,23486.0,23486.0,23486.0
mean,918.118709,43.198544,4.196032,0.822362,2.535936
std,203.29898,12.279544,1.110031,0.382216,5.702202
min,0.0,18.0,1.0,0.0,0.0
25%,861.0,34.0,4.0,1.0,0.0
50%,936.0,41.0,5.0,1.0,1.0
75%,1078.0,52.0,5.0,1.0,3.0
max,1205.0,99.0,5.0,1.0,122.0


What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [57]:
# Make your notes here:

Techniques that helped:
1. Using the Interquartile Range (IQR) method:
   - Calculating Q1 (25th percentile) and Q3 (75th percentile).
   - Computing IQR = Q3 - Q1.
   - Defining outliers as values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR.
   - Filtering the DataFrame to isolate these outlier points.

2. The describe() method was useful to get quick summary statistics 
   (mean, median, quartiles) that guided outlier detection.

Why these techniques were effective:
- The IQR method is robust to skewness and unaffected by extreme values,
  making it reliable for detecting unusual values.
- It adapts to the spread of the data without assuming normal distribution.
- It is easy to implement and interpret.


SyntaxError: invalid syntax (497830628.py, line 3)

## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplicate column. Check out the dataset to see if there is any unnecessary data.

In [68]:
# Look out for unnecessary data!
df = df.drop(columns=['Unnamed: 0'],errors='ignore')


Did you find any unnecessary data in your dataset? How did you handle it?

In [28]:
# Make your notes here.
Yes. The column 'Unnamed: 0' is simply a duplicate index column created during export.
Since the DataFrame already has an index, this column does not add any value.

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [74]:
# Look out for inconsistent data!

df['Division Name'] = df['Division Name'].replace({'Initmates': 'Intimates'})

df['Title'] = df['Title'].str.lower().str.strip()
df['Review Text'] = df['Review Text'].str.lower().str.strip()

print(df)



       Clothing ID  Age                                              Title  \
0              767   33                                                      
1             1080   34                                                      
2             1077   60                            some major design flaws   
3             1049   50                                   my favorite buy!   
4              847   47                                   flattering shirt   
...            ...  ...                                                ...   
23481         1104   34                     great dress for many occasions   
23482          862   48                         wish it was made of cotton   
23483         1104   31                              cute, but see through   
23484         1084   28  very cute dress, perfect for summer parties an...   
23485         1104   52                    please make more like this one!   

                                             Review Text  Ratin

Did you find any inconsistent data? What did you do to clean it?

In [75]:
Yes. I found a misspelled category value (“Initmates”) in the Division Name column, and I also found inconsistent 
capitalization and extra spaces in text fields. I cleaned this by correcting the spelling, converting all text to 
lowercase.


SyntaxError: EOL while scanning string literal (2395012488.py, line 1)