# Assignment 4: Data Preprocessing using Python (Pandas)

In this assignment, we perform various data manipulation operations
on Facebook metrics dataset and Amazon book reviews dataset.

Libraries Used:
- pandas → Data manipulation
- numpy → Numerical operations


In [None]:
import pandas as pd
import numpy as np

## Loading the Datasets

Facebook dataset uses ';' as separator.
Amazon dataset uses ',' as separator.

In [None]:
facebook_df = pd.read_csv("dataset_Facebook.csv", sep=';')
amazon_df = pd.read_csv("amazon_book_reviews.csv")

facebook_df.head()

In [None]:
amazon_df.head()

### Using loc() – Label Based Indexing
Select rows 0–3 and columns 'like' and 'share'

In [None]:
facebook_df.loc[0:3, ['like', 'share']]

### Using iloc() – Index Based Indexing
Select first 4 rows and first 3 columns

In [None]:
facebook_df.iloc[0:4, 0:3]

### Conditional Subset
Select posts having more than 200 likes

In [None]:
facebook_df[facebook_df['like'] > 200]

## Merging Facebook and Amazon datasets

Here we simulate merging on a common key.
We assume 'Category' in Facebook relates to 'Rating' in Amazon.

In [None]:
merged_df = pd.merge(
    facebook_df,
    amazon_df,
    left_on='Category',
    right_on='Rating',
    how='inner'
)

merged_df.head()

## Sorting Data

Sort Facebook posts based on number of likes in descending order.

In [None]:
facebook_df.sort_values(by='like', ascending=False)

## Transposing Data

Transpose swaps rows and columns.

In [None]:
facebook_df.T

## Melting Data

Convert wide format into long format.

In [None]:
melted_df = pd.melt(
    facebook_df,
    id_vars=['Type'],
    value_vars=['like', 'share'],
    var_name='Metric',
    value_name='Value'
)

melted_df.head()

## Pivot (Long → Wide)

Reshape melted data back to wide format.

## Pivot vs Pivot_table

The pivot() function fails when duplicate index-column combinations exist.

Since multiple posts belong to the same Type category,
we must use pivot_table() which performs aggregation.

Here we use mean to calculate average likes and shares per Type.

In [17]:
pivot_df = melted_df.pivot_table(
    index='Type',
    columns='Metric',
    values='Value',
    aggfunc='mean'   # or 'sum'
)

pivot_df

Metric,like,share
Type,Unnamed: 1_level_1,Unnamed: 2_level_1
Link,300.0,60.0
Photo,103.333333,18.333333
Status,200.0,50.0
Video,350.0,75.0


## Conclusion

In this assignment we learned:

- Creating subsets using loc() and iloc()
- Merging datasets using pd.merge()
- Sorting data using sort_values()
- Transposing data using .T
- Melting data using pd.melt()
- Casting data using pivot()

These preprocessing techniques are essential in data analytics
for cleaning, transforming, and preparing data before analysis.