<a href="https://colab.research.google.com/github/Manish927/EDA-Data-Science/blob/feat/naykaa/exercise_nykaa_eda.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv('nykaa_eda.csv')
df.head()

# Task 1
## Inspect the data
- Load the data
- Study the shape of the data
- Check the data for data types and missing values
- Display top 5 rows
- Study the summary statistics of the data using `describe()` and include all data types in the result

In [None]:
##### CODE HERE #####
df.shape
df.info()
df.describe()

# Task 2
## Analyse product price distribution
- Plot the discrete histogram of `'product_price'` with 5 bins
- Look for trends in the distribution

In [None]:
##### CODE HERE #####
df['product_category'].value_counts()
plt.figure()
df['product_category'].value_counts().plot(kind='bar')
plt.title("Product Category Distribution")
plt.xlabel("Category")
plt.ylabel("Count")
plt.show()

# Task 3
## Analyse product brand counts
- Obtain the counts of all the brands in the dataset
- Find the top 5 brands based on the count in descending order
- Create a copy of the data frame with only these identified brands
- Check the shape of the data subset that you created

In [None]:
##### CODE HERE #####
df['product_brand'].value_counts().head(10)
plt.figure()
df['product_brand'].value_counts().head(10).plot(kind='bar')
plt.title("Top 10 Brands")
plt.xlabel("Brand")
plt.ylabel("Count")
plt.show()

**Note:** From this point onwards, please use the subset data that you created in the previous step

# Task 4
## Analyse price distribution by brand
- Obtain the mean product price of each brand

In [None]:
##### CODE HERE #####
df['product_price'].describe()

# Task 5
## Analyse price distribution by product category
- Obtain the mean product price of each product category
- Sort the result in ascending order

In [None]:
##### CODE HERE #####
plt.figure()
plt.hist(df['product_price'], bins=30)
plt.title("Product Price Distribution")
plt.xlabel("Price")
plt.ylabel("Frequency")
plt.show()

# Task 6
## Analyse product ratings by presence of reviews
- Obtain the mean product ratings for products that have at least one review and those that have no reviews

In [None]:
##### CODE HERE #####
df['product_rating'].describe()
plt.figure()
plt.hist(df['product_rating'], bins=10)
plt.title("Product Rating Distribution")
plt.xlabel("Rating")
plt.ylabel("Frequency")
plt.show()

# Task 7
## Analyse product prices by presence of reviews
- Obtain the mean product prices for products that have at least one review and those that have no reviews

In [None]:
##### CODE HERE #####
df['product_reviews_count'].describe()
plt.figure()
plt.hist(df['product_reviews_count'], bins=30)
plt.title("Reviews Count Distribution")
plt.xlabel("Number of Reviews")
plt.ylabel("Frequency")
plt.show()

**Note:** For tasks that require you to compute natural logarithms, please use the `log1p` function from `numpy`

# Task 8
## Analyse relation between review counts and ratings
- Plot a scatter plot between `'product_reviews_count'` and `'product_rating'`
- Plot another scatter plot between natural logarithm of `'product_reviews_count'` and `'product_rating'`

In [None]:
##### CODE HERE #####
# Create log of product reviews (+1 to avoid log(0))
df['log_reviews'] = np.log(df['product_reviews_count'] + 1)

# ------------------------------------------------
# 1️⃣ Scatter Plot: Reviews vs Rating
# ------------------------------------------------
plt.figure()
sns.scatterplot(
    data=df,
    x='product_reviews_count',
    y='product_rating'
)
plt.title("Product Reviews Count vs Product Rating")
plt.xlabel("Product Reviews Count")
plt.ylabel("Product Rating")
plt.show()


# ------------------------------------------------
# 2️⃣ Scatter Plot: Log(Reviews) vs Rating
# ------------------------------------------------
plt.figure()
sns.scatterplot(
    data=df,
    x='log_reviews',
    y='product_rating'
)
plt.title("Log(Product Reviews Count) vs Product Rating")
plt.xlabel("Log(Product Reviews Count)")
plt.ylabel("Product Rating")
plt.show()

# Task 9
## Analyse relation between product ratings and price
- Plot a scatter plot between `'product_rating'` and `'product_price'`
- Plot another scatter plot between `'product_rating'` and natural logarithm of `'product_price'`
- Plot a regression plot using `regplot` from `seaborn` between `'product_rating'` and natural logarithm of `'product_price'`

In [None]:
##### CODE HERE #####
# Create log of product price
df['log_price'] = np.log(df['product_price'])

# ------------------------------------------------
# 1️⃣ Scatter Plot: Rating vs Price
# ------------------------------------------------
plt.figure()
sns.scatterplot(
    data=df,
    x='product_price',
    y='product_rating'
)
plt.title("Product Rating vs Product Price")
plt.xlabel("Product Price")
plt.ylabel("Product Rating")
plt.show()


# ------------------------------------------------
# 2️⃣ Scatter Plot: Rating vs Log(Price)
# ------------------------------------------------
plt.figure()
sns.scatterplot(
    data=df,
    x='log_price',
    y='product_rating'
)
plt.title("Product Rating vs Log(Product Price)")
plt.xlabel("Log(Product Price)")
plt.ylabel("Product Rating")
plt.show()


# ------------------------------------------------
# 3️⃣ Regression Plot: Rating vs Log(Price)
# ------------------------------------------------
plt.figure()
sns.regplot(
    data=df,
    x='log_price',
    y='product_rating'
)
plt.title("Regression: Rating vs Log(Product Price)")
plt.xlabel("Log(Product Price)")
plt.ylabel("Product Rating")
plt.show()


# Task 10
## Analyse relation between review counts, ratings, and brand
- Plot a scatter plot between natural logarithm of `'product_reviews_count'` and `'product_rating'` and set hue as `'product_brand'`
- Plot a regression plot using `regplot` from `seaborn` between natural logarithm of `'product_reviews_count'` and `'product_rating'` only for the brand `'LIME CRIME'`
- Plot another regression plot using `regplot` from `seaborn` between natural logarithm of `'product_reviews_count'` and `'product_rating'` only for the brand `'LAKME'`

In [None]:
# Create log of review count (+1 to avoid log(0))
df['log_reviews'] = np.log(df['product_reviews_count'] + 1)

# -------------------------------
# 1️⃣ Scatter Plot (All Brands)
# -------------------------------
plt.figure()
sns.scatterplot(
    data=df,
    x='log_reviews',
    y='product_rating',
    hue='product_brand'
)
plt.title("Log(Review Count) vs Product Rating (All Brands)")
plt.show()


# -------------------------------
# 2️⃣ Regression Plot - LIME CRIME
# -------------------------------
plt.figure()
sns.regplot(
    data=df[df['product_brand'] == 'LIME CRIME'],
    x='log_reviews',
    y='product_rating'
)
plt.title("LIME CRIME: Log(Review Count) vs Rating")
plt.show()


# -------------------------------
# 3️⃣ Regression Plot - LAKME
# -------------------------------
plt.figure()
sns.regplot(
    data=df[df['product_brand'] == 'LAKME'],
    x='log_reviews',
    y='product_rating'
)
plt.title("LAKME: Log(Review Count) vs Rating")
plt.show()

# Task 11
## Analyse relation between ratings, price, and brand
- Plot a scatter plot between `'product_rating'` and `'product_price'` and set hue as `'product_brand'`
- Plot a scatter plot between `'product_rating'` and natural logarithm of `'product_price'` and set hue as `'product_brand'`
- Plot a regression plot using `regplot` from `seaborn` between `'product_rating'` and natural logarithm of `'product_price'` only for the brand `'HIMALAYA'`
- Plot a regression plot using `regplot` from `seaborn` between `'product_rating'` and natural logarithm of `'product_price'` only for the brand `'LIME CRIME'`

In [None]:
##### CODE HERE #####
# Create log of product price
df['log_price'] = np.log(df['product_price'])

# ------------------------------------------------
# 1️⃣ Scatter Plot: Rating vs Price (All Brands)
# ------------------------------------------------
plt.figure()
sns.scatterplot(
    data=df,
    x='product_price',
    y='product_rating',
    hue='product_brand'
)
plt.title("Product Rating vs Product Price")
plt.show()


# ------------------------------------------------
# 2️⃣ Scatter Plot: Rating vs Log(Price)
# ------------------------------------------------
plt.figure()
sns.scatterplot(
    data=df,
    x='log_price',
    y='product_rating',
    hue='product_brand'
)
plt.title("Product Rating vs Log(Product Price)")
plt.show()


# ------------------------------------------------
# 3️⃣ Regression Plot: HIMALAYA
# ------------------------------------------------
plt.figure()
sns.regplot(
    data=df[df['product_brand'] == 'HIMALAYA'],
    x='log_price',
    y='product_rating'
)
plt.title("HIMALAYA: Rating vs Log(Price)")
plt.show()


# ------------------------------------------------
# 4️⃣ Regression Plot: LIME CRIME
# ------------------------------------------------
plt.figure()
sns.regplot(
    data=df[df['product_brand'] == 'LIME CRIME'],
    x='log_price',
    y='product_rating'
)
plt.title("LIME CRIME: Rating vs Log(Price)")
plt.show()