<a href="https://colab.research.google.com/github/Anjali-Narwaria/Cleaned-dataset-Task1-/blob/main/Task_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [16]:
#Load the Dataset with Pandas
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/amazon.csv')

In [39]:
# Identify and handle missing values
print("Missing values before handling:")
print(df.isnull().sum())  # Count missing values per column

# For demonstration, let's fill numerical missing values with the mean and categorical with the mode
for column in df.columns:
    if df[column].dtype == 'object':
        # Fill missing object type with mode
        mode_value = df[column].mode()[0] if not df[column].mode().empty else 'Unknown'
        df[column].fillna(mode_value, inplace=True)
    elif df[column].dtype in ['int64', 'float64']:
        # Fill missing numerical with mean
        df[column].fillna(df[column].mean(), inplace=True)

print("\nMissing values after handling:")
print(df.isnull().sum())

Missing values before handling:
product_id             0
product_name           0
category               0
discounted_price       0
actual_price           0
discount_percentage    0
rating                 0
rating_count           0
about_product          0
user_id                0
user_name              0
review_id              0
review_title           0
review_content         0
img_link               0
product_link           0
dtype: int64

Missing values after handling:
product_id             0
product_name           0
category               0
discounted_price       0
actual_price           0
discount_percentage    0
rating                 0
rating_count           0
about_product          0
user_id                0
user_name              0
review_id              0
review_title           0
review_content         0
img_link               0
product_link           0
dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(mode_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mean(), inplace=True)


In [40]:
# Remove duplicate rows
initial_rows = len(df)
df = df.drop_duplicates()
rows_after_dropping_duplicates = len(df)
print(f"\nRemoved {initial_rows - rows_after_dropping_duplicates} duplicate rows.")



Removed 0 duplicate rows.


In [41]:
# Standardize text values (example for 'category' and 'product_name')
for col in ['category', 'product_name']:
    if col in df.columns and df[col].dtype == 'object':
        df[col] = df[col].str.strip().str.lower()
        print(f"\nStandardized text in column: {col}")


Standardized text in column: category

Standardized text in column: product_name


In [44]:
# Rename column headers to be clean and uniform (lowercase and underscores, no spaces)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')
print("\nCleaned column headers.")


Cleaned column headers.


In [47]:
# Check and fix data types
print("\nData types before fixing:")
print(df.dtypes)


# Example for potential price columns:
for col in ['discounted_price', 'actual_price']:
    if col in df.columns and df[col].dtype == 'object':
        # Remove currency symbols and commas, then convert to numeric
        df[col] = df[col].astype(str).str.replace('₹', '').str.replace(',', '')
        df[col] = pd.to_numeric(df[col], errors='coerce')
        print(f"Converted '{col}' to numeric.")

# Example for potential rating column:
if 'rating' in df.columns and df['rating'].dtype == 'object':
    # Convert rating to numeric, coercing errors will turn non-numeric into NaN
    df['rating'] = pd.to_numeric(df['rating'], errors='coerce')
    print("Converted 'rating' to numeric.")

print("\nData types after fixing:")
print(df.dtypes)

# View cleaned dataframe info
print("\nCleaned DataFrame Info:")
print(df.info())


Data types before fixing:
product_id              object
product_name            object
category                object
discounted_price       float64
actual_price           float64
discount_percentage     object
rating                 float64
rating_count            object
about_product           object
user_id                 object
user_name               object
review_id               object
review_title            object
review_content          object
img_link                object
product_link            object
dtype: object

Data types after fixing:
product_id              object
product_name            object
category                object
discounted_price       float64
actual_price           float64
discount_percentage     object
rating                 float64
rating_count            object
about_product           object
user_id                 object
user_name               object
review_id               object
review_title            object
review_content          object
img

# **Summary**
I cleaned the Amazon Sales Dataset by filling missing numbers with averages and categorical data with the most common values, so nothing important was lost. Duplicate rows were removed to keep the data accurate. I also made sure text columns like 'category' and 'product_name' were consistent by trimming spaces and using lowercase. The column headers were cleaned up to be lowercase with underscores for easier coding. Key numeric fields like prices and ratings were carefully converted to proper types. Along the way, I fixed minor coding mistakes.

This made the dataset neat, organized, and ready for meaningful analysis.