<a href="https://colab.research.google.com/github/Manish927/EDA-Data-Science/blob/feat/nykaa/exercise_nykaa_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd

**Note:** This is a data cleaning and preprocessing exercise. Ensure that all data preprocessing and transformations are continually updated in the data frame that will be saved towards the end of the exercise.

# Task 1
## Inspect the data
- Load the data
- Display the first 5 rows
- Display 5 random rows
- Check the number of rows and columns
- Inspect the data types of all features

In [None]:
##### CODE HERE #####
df = pd.read_csv('nykaa.csv')
df.head()
df.sample(5)
df.shape
df.dtypes

In [None]:
from google.colab import files

uploaded = files.upload()

# Task 2
## Fix data types
- Ensure that all features are stored in the most suitable data type
- Convert the `'Product Price'` feature to `float64` data type

In [None]:
##### CODE HERE #####
df['Product Price'] = df['Product Price'].astype('float64')

# Task 3
## Handle missing values
- Identify columns with missing values
- Replace missing values in string columns with `'nodata'`
- Replace missing numeric values with median

In [None]:
##### CODE HERE #####
print(df.isnull().sum())
for col in df.select_dtypes(include='string').columns:
    df[col] = df[col].fillna('nodata')

In [10]:
for col in df.select_dtypes(include=['int64', 'float64']).columns:
    df[col] = df[col].fillna(df[col].median())

# Task 4
## Remove duplicate records
- Check whether duplicate rows exist
- Remove duplicate records
- Display how many rows were removed

In [13]:
##### CODE HERE #####
print(df.duplicated().sum())
df = df.drop_duplicates()
print(df.duplicated().sum())

11
0


# Task 5
## Standardise column names
- Remove any leading and trailing whitespaces from column names
- Convert all column names to lowercase
- Replace spaces between words with underscores (`_`)
- Ensure there are no multiple consecutive underscores
- Remove any leading or trailing underscores
- Display the updated column names

In [14]:
##### CODE HERE #####
df.columns = (
    df.columns
      .str.strip()
      .str.lower()
      .str.replace(' ', '_')
      .str.replace('__+', '_', regex=True)
      .str.strip('_')
)

# Task 6
## Engineer new features
- Create a column `'price_category'`:
    - `'low'` if product price is less than 500
    - `'medium'` if product price is between 500 and 2000 (both inclusive)
    - `'high'` if product price is greater than 2000
- Create a column `'has_reviews'`:
    - `'yes'` if the product has at least one review
    - `'no'` otherwise
- Obtain the mode of the `'price_category'` feature

In [15]:
##### CODE HERE #####
def price_category(price):
    if price < 500:
        return 'low'
    elif price <= 2000:
        return 'medium'
    else:
        return 'high'

df['price_category'] = df['product_price'].apply(price_category)

In [16]:
df['has_reviews'] = df['product_reviews_count'].apply(lambda x: 'yes' if x > 0 else 'no')

In [17]:
df['price_category'].mode()[0]

'low'

# Task 7
## Treat outliers
- Use the interquartile range (IQR) method to identify outliers in the `'product_price'` column
  - IQR = Q3 - Q1, UL = Q3 + 1.5 * IQR, LL = Q1 - 1.5 * IQR
- Find the number of outliers according to the IQR method and treat them by capping them to the UL or the LL accordingly

In [18]:
##### CODE HERE #####
Q1 = df['product_price'].quantile(0.25)
Q3 = df['product_price'].quantile(0.75)
IQR = Q3 - Q1

LL = Q1 - 1.5 * IQR
UL = Q3 + 1.5 * IQR

# Count outliers
outliers = df[(df['product_price'] < LL) | (df['product_price'] > UL)]
print(len(outliers))

# Cap them
df['product_price'] = df['product_price'].clip(lower=LL, upper=UL)


27


# Task 8
## Write cleaned dataset to disk

In [None]:
df.to_csv('nykaa_eda.csv', index = False)