<a href="https://www.kaggle.com/code/hassaneskikri/women-s-e-commerce-clothing-reviews?scriptVersionId=168046444" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<div style="font-family: Trebuchet MS; background-color: #f8f9fa; border-left: 5px solid #1b4332; padding: 12px;">
    <h1 style="font-size: 30px; color: #081c15; text-align: center; line-height: 1.25;">Pandas Fundamental</h1>
    <h2 style="color: #1b4332; font-size: 48px; text-align: center;"><b>Analyzing Women's Clothing E-Commerce using pandas</b></h2>
    <hr style="border-top: 2px solid #264653;">
    <h3 style="font-size: 14px; color: #264653; text-align: right; "><strong>Created By: Hassane Skikri</strong></h3>
</div>




## 1. Project Setup
- [✔️] Install necessary Python libraries (pandas, matplotlib, seaborn, etc.).
- [✔️] Load the Women's Clothing E-Commerce Reviews dataset.

## 2. Data Loading and Inspection
- [✔️] Load the dataset using pandas.
- [✔️] Perform basic data inspection (shape, size, data types, info, describe ...)
- [✔️] View the first few rows to understand the data structure.

## 3. Data Cleaning and Preprocessing
- [✔️] Handle missing values (check, impute, or remove).
- [✔️] Convert data types if necessary (e.g., convert numeric fields to appropriate formats).
- [✔️] Check for and handle duplicate entries.

<p style="background-color: #12f7ff; font-family: 'Trebuchet MS', sans-serif; color: #000; font-size: 150%; text-align: center; border-radius: 50px 15px; padding: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);">
    🔻1️⃣ Importing Libraries 🔻
</p>


In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd . read_csv('/kaggle/input/womens-ecommerce-clothing-reviews/Womens Clothing E-Commerce Reviews.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


<p style="background-color: #12f7ff; font-family: 'Trebuchet MS', sans-serif; color: #000; font-size: 150%; text-align: center; border-radius: 50px 15px; padding: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);">
    🔻data inspection 🔻
</p>


In [4]:
# Basic data inspection
print(df.shape)

(23486, 11)


***The dataset contains 23,486 rows and 11 columns.***

In [5]:
print(list(df.columns))

['Unnamed: 0', 'Clothing ID', 'Age', 'Title', 'Review Text', 'Rating', 'Recommended IND', 'Positive Feedback Count', 'Division Name', 'Department Name', 'Class Name']


# Describe this features


#### - **Unnamed**: 0: Appears to be an index or ID column.
#### - **Clothing ID:** Identifier for the clothing item.
#### - **Age:** Age of the reviewer.
#### - **Title:** Title of the review.
#### - **Review Text:** Text of the review.
#### - **Rating**: Rating given by the reviewer.
#### - **Recommended IND:** Indicator of whether the reviewer recommends the product (1 for yes, 0 for no).
- **Positive Feedback Count:** Count of positive feedbacks for the review.
- **Division Name, Department Name, Class Name:** Categorization of the clothing item.

- **Unnamed**: 0: It seems to be an index or ID column, so it does not have any impact on our target, so we can drop it.

- **Clothing ID** : also does not have any ampact we can drop it 

In [6]:
df = df.drop(columns=['Unnamed: 0','Clothing ID'])

In [7]:
df.columns

Index(['Age', 'Title', 'Review Text', 'Rating', 'Recommended IND',
       'Positive Feedback Count', 'Division Name', 'Department Name',
       'Class Name'],
      dtype='object')

In [8]:
df.describe()

Unnamed: 0,Age,Rating,Recommended IND,Positive Feedback Count
count,23486.0,23486.0,23486.0,23486.0
mean,43.198544,4.196032,0.822362,2.535936
std,12.279544,1.110031,0.382216,5.702202
min,18.0,1.0,0.0,0.0
25%,34.0,4.0,1.0,0.0
50%,41.0,5.0,1.0,1.0
75%,52.0,5.0,1.0,3.0
max,99.0,5.0,1.0,122.0


In [9]:
#print .info
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Age                      23486 non-null  int64 
 1   Title                    19676 non-null  object
 2   Review Text              22641 non-null  object
 3   Rating                   23486 non-null  int64 
 4   Recommended IND          23486 non-null  int64 
 5   Positive Feedback Count  23486 non-null  int64 
 6   Division Name            23472 non-null  object
 7   Department Name          23472 non-null  object
 8   Class Name               23472 non-null  object
dtypes: int64(4), object(5)
memory usage: 1.6+ MB
None


#### ***The dataset comprises both numerical (int64) and categorical (object) columns.***

In [10]:
numerical_columns = [i for i in df.columns if df[i].dtype == 'int64']
categorical_columns = [i for i in df.columns if i not in  numerical_columns]

In [11]:
print(list[numerical_columns])
print(list[categorical_columns])

list[['Age', 'Rating', 'Recommended IND', 'Positive Feedback Count']]
list[['Title', 'Review Text', 'Division Name', 'Department Name', 'Class Name']]


In [12]:
#value_counts()
#look at the counts of recommended
print(df['Recommended IND'].value_counts())

Recommended IND
1    19314
0     4172
Name: count, dtype: int64


<p style="background-color: #12f7ff; font-family: 'Trebuchet MS', sans-serif; color: #000; font-size: 150%; text-align: center; border-radius: 50px 15px; padding: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);">
    🔻Data Cleaning and Preprocessing 🔻
</p>


In [13]:

# Handling missing values
missing_values = df.isnull().sum()

# Checking for duplicate entries
duplicate_entries = df.duplicated().sum()

print(missing_values,'\n')
print(duplicate_entries)

Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
dtype: int64 

232


In [14]:
# Creating a new derived variable - word count from review text
df['Review Word Count'] = df['Review Text'].apply(lambda x: len(str(x).split()))
df['Review Word Count']

0         8
1        62
2        98
3        22
4        36
         ..
23481    28
23482    38
23483    42
23484    86
23485    19
Name: Review Word Count, Length: 23486, dtype: int64

In [15]:
 df.isnull().sum()

Age                           0
Title                      3810
Review Text                 845
Rating                        0
Recommended IND               0
Positive Feedback Count       0
Division Name                14
Department Name              14
Class Name                   14
Review Word Count             0
dtype: int64

**the dataset had missing values in several columns:**

- Title: 3,810 missing values.
- Review Text: 845 missing values.
- Division Name, Department Name, Class Name: 14 missing values each.

In [16]:
# Handling missing values

# Replace missing 'Title' with 'No Title'
df['Title'].fillna('No Title', inplace=True)

In [17]:
# Remove rows where 'Review Text' is missing
df.dropna(subset=['Review Text'], inplace=True)

In [18]:
# Fill missing values in 'Division Name', 'Department Name', and 'Class Name' with 'Unknown'
df['Division Name'].fillna('Unknown', inplace=True)
df['Department Name'].fillna('Unknown', inplace=True)
df['Class Name'].fillna('Unknown', inplace=True)

In [19]:
# Removing duplicate entries if exist in our cas ther is not duplicated value in our dataset
df.drop_duplicates(inplace=True)

# Updated dataset inspection
updated_missing_values = df.isnull().sum()
updated_shape = df.shape

updated_missing_values, updated_shape


(Age                        0
 Title                      0
 Review Text                0
 Rating                     0
 Recommended IND            0
 Positive Feedback Count    0
 Division Name              0
 Department Name            0
 Class Name                 0
 Review Word Count          0
 dtype: int64,
 (22639, 10))

<p style="background-color: #12f7ff; font-family: 'Trebuchet MS', sans-serif; color: #000; font-size: 150%; text-align: center; border-radius: 50px 15px; padding: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19);">
    🔻Next Steps 🔻
</p>

#### we can now proceed with exploratory data analysis (EDA) to uncover trends, patterns, and insights from the dataset.