# Cleaning Data with Pandas Exercises

For the exercises, you will be cleaning data in the Women's Clothing E-Commerce Reviews dataset.

To start cleaning data, we first need to create a dataframe from the CSV and print out any relevant info to make sure our dataframe is ready to go.

In [None]:
# Import pandas and any other libraries you need here.
import pandas as pd
data_file = pd.read_csv('/Users/miafusco/Documents/LaunchCode/data-analysis-projects2/Womens Clothing E-Commerce Reviews.csv')
# Create a new dataframe from your CSV
data_indexed=data_file.set_index(['Clothing ID','Age','Title','Review Text','Rating','Recommended IND','Positive Feedback Count','Division Name','Department Name','Class Name'])

In [None]:
# Print out any information you need to understand your dataframe
data_indexed_title=data_file.set_index('Title')
print(data_indexed_title)

## Missing Data

Try out different methods to locate and resolve missing data.

In [None]:
# Try to find some missing data!
my_series=pd.Series(('Clothing ID','Age','Title','Review Text','Rating','Recommended IND','Positive Feedback Count','Division Name','Department Name','Class Name'))
my_series.isna()
my_series.isnull()
missing_values=data_indexed_title.isna()
print(missing_values)
# Title                                                                               ...                                            
# NaN                                                      False        False  False  ...          False            False       False
# NaN                                                      False        False  False  ...          False            False       False
# Some major design flaws                                  False        False  False  ...          False            False       False
# My favorite buy!                                         False        False  False  ...          False            False       False
# Flattering shirt                                         False        False  False  ...          False            False       False
# ...                                                        ...          ...    ...  ...            ...              ...         ...
# Great dress for many occasions                           False        False  False  ...          False            False       False
# Wish it was made of cotton                               False        False  False  ...          False            False       False
# Cute, but see through                                    False        False  False  ...          False            False       False
# Very cute dress, perfect for summer parties and we       False        False  False  ...          False            False       False
# Please make more like this one!                          False        False  False  ...          False            False       False

# [23486 rows x 10 columns]
data_indexed_title=data_indexed_title.dropna(how="any")
print(data_indexed_title)
# Title                                                                             ...                                             
# NaN                                                          0          767   33  ...       Initmates         Intimate   Intimates
# NaN                                                          1         1080   34  ...         General          Dresses     Dresses
# Some major design flaws                                      2         1077   60  ...         General          Dresses     Dresses
# My favorite buy!                                             3         1049   50  ...  General Petite          Bottoms       Pants
# Flattering shirt                                             4          847   47  ...         General             Tops     Blouses
# ...                                                        ...          ...  ...  ...             ...              ...         ...
# Great dress for many occasions                           23481         1104   34  ...  General Petite          Dresses     Dresses
# Wish it was made of cotton                               23482          862   48  ...  General Petite             Tops       Knits
# Cute, but see through                                    23483         1104   31  ...  General Petite          Dresses     Dresses
# Very cute dress, perfect for summer parties and we       23484         1084   28  ...         General          Dresses     Dresses
# Please make more like this one!                          23485         1104   52  ...  General Petite          Dresses     Dresses

# [22628 rows x 10 columns]

Did you find any missing data? What things worked well for you and what did not?

In [None]:
# Respond to the above questions here:
#Yes.  

## Irregular Data

With missing data out of the way, turn your attention to any outliers. Just as we did for missing data, we first need to detect the outliers.

In [None]:
# Keep an eye out for outliers!
data_indexed_title=data_indexed_title.describe()
#      Unnamed: 0   Clothing ID           Age        Rating  Recommended IND  Positive Feedback Count
# count  22628.000000  22628.000000  22628.000000  22628.000000     22628.000000             22628.000000
# mean   11737.272097    919.695908     43.282880      4.183092         0.818764                 2.631784
# std     6781.574232    201.683804     12.328176      1.115911         0.385222                 5.787520
# min        0.000000      1.000000     18.000000      1.000000         0.000000                 0.000000
# 25%     5868.750000    861.000000     34.000000      4.000000         1.000000                 0.000000
# 50%    11727.500000    936.000000     41.000000      5.000000         1.000000                 1.000000
# 75%    17617.250000   1078.000000     52.000000      5.000000         1.000000                 3.000000
# max    23485.000000   1205.000000     99.000000      5.000000         1.000000               122.000000
outlier = [(data_indexed_title['Rating'] < 0.0) & (data_indexed_title['Rating'] > 5.0)]

What techniques helped you find outliers? In your opinion, what about the techniques you used made them effective?

In [None]:
# Make your notes here:


## Unnecessary Data

Unnecessary data could be irrelevant to your analysis or a duplice column. Check out the dataset to see if there is any unnecessary data.

In [None]:
# Look out for unnecessary data!

Did you find any unnecessary data in your dataset? How did you handle it?

In [None]:
# Make your notes here.
#It depends what the identified business issue is. If I'm just looking at the department with the highest reviews, I probably don't need age.
# However if I'm looking to identify demographics favorite deparmtents to shop, I would need age. 

## Inconsistent Data

Inconsistent data is likely due to inconsistent formatting and can be addressed by re-formatting all values in a column or row.

In [None]:
# Look out for inconsistent data!

Did you find any inconsistent data? What did you do to clean it?

In [None]:
# Make your notes here!
# I don't think there is much inconsistent data. 