**Assignment Submission Guidelines**

**1. Submission Platform:**

- Submit your completed assignment through [Specify Platform: e.g., Google Classroom, Canvas, GitHub Classroom, etc.].

**2. Submission Format:**

- Submit the Google Colab Notebook (.ipynb file) provided as the assignment template.
- Do not create a new notebook. Fill in the provided template.

**3. Template Completion:**

The template notebook contains:
- The code to generate the student_performance_detailed_nan.csv dataset.
- Placeholders for your code and explanations for each question.

Follow the instructions within the template.
- Code Cells:
  - Place your code solutions directly in the designated code cells below each question.
- Markdown Cells:
  - Provide your explanations and justifications in the designated Markdown cells.
- Report section:
  - Complete the markdown section at the bottom of the notebook titled "Report".
  - In this section, compile the explanation of each of the questions.
  - Answer the following data analysis questions:
    
  1. What are the key characteristics of the user base and their posts in this dataset?
  2. Which factors appear to have the strongest influence on post engagement (likes, retweets)?
  3. What are the most common missing data patterns, and what implications might they have on our analysis?
  4. Based on your analysis, what are 2-3 recommendations you would make to improve post engagement or user reac
- Do not modify the structure of the template notebook.

**4. File Naming:**

Ensure the file name remains as provided in the template. Do not rename the file.

**5. Timely Submission:**

- Submit your completed template notebook by the deadline: **24th of March, 2025**.
- Late submissions will be penalized as follows:
- Submissions within **5:00pm 26th of March, 2025**  will receive a maximum of 5 marks for timely submission.
Submissions after  will receive 0 marks for timely submission.

**6. Report:**

- Complete the "Report" section at the end of your notebook.
- Ensure your report is:
  - Well-organized and easy to read.
  - Clear and concise.
  - Free of grammatical errors.

**7. Code Execution:**

Ensure your completed notebook runs without errors from top to bottom.
Before submitting, restart the kernel and run all cells to confirm reproducibility.



**8. Academic Integrity:**

All work must be your own.
Plagiarism will result in a failing grade.
Cite any external resources you use.



**Tips for Success:**

- Start the assignment early.
- Read the instructions within the template carefully.
- Plan your approach before coding.
- Test your code thoroughly.
- Document your work clearly.
- Review the rubrics to understand the grading criteria.


**Grading Rubrics:**

Total 50 Marks

- Timely Submission: 10 Marks
- Report : 10 Marks
- Level 1 (Basic Questions): 5 Marks (1 x 5 = 5)
- Level 2 (Intermediate Questions): 10 Marks (2 x 5 = 10)
- Level 3 (Advanced Questions): 15 Marks (3 x 5 = 15)

##**Assignment Title: Unveiling Social Media Trends - A Data Exploration**

**Background**

You are a data analyst working for "SocialPulse," a social media analytics firm. SocialPulse helps businesses and organizations understand their online presence, track trends, and engage with their audience effectively.

Your team has collected a dataset of posts (tweets) from a social media platform. This dataset includes information on post content, user demographics, engagement metrics, and sentiment.

Your goal is to explore and analyze this data to uncover key trends and patterns that can help clients:

Understand audience engagement and sentiment. Identify trending topics and influential users. Optimize social media content and marketing strategies. Track the impact of campaigns and events. In this assignment, you will explore and analyze the social media posts dataset. If you choose to tackle the advanced level, you will delve deeper by building predictive models to understand the key drivers of post engagement and provide recommendations for enhancing future data collection.

**Dataset (Synthetic):**

We will create a synthetic dataset.

In [None]:
import pandas as pd
import numpy as np
import random
from datetime import datetime, timedelta

def generate_social_media_data(num_posts=1000):
    """Generates synthetic social media data."""

    data = []
    for post_id in range(1, num_posts + 1):
        user_id = random.randint(1001, 2000)
        post_content = "This is a synthetic post. #example"
        post_date = datetime(2023, 1, 1) + timedelta(days=random.randint(0, 364))
        user_age = random.randint(18, 60)
        user_gender = random.choice(['Male', 'Female', 'Other'])
        user_location = random.choice(['Urban', 'Suburban', 'Rural'])
        likes = random.randint(0, 500)
        retweets = random.randint(0, 200)
        hashtags = random.choice(['#example', '#trending', '#news', '#tech', '#social', np.nan])
        sentiment = random.choice(['Positive', 'Negative', 'Neutral', np.nan])
        verified_user = random.choice([True, False])

        data.append({
            'PostID': post_id,
            'UserID': user_id,
            'PostContent': post_content,
            'PostDate': post_date,
            'UserAge': user_age,
            'UserGender': user_gender,
            'UserLocation': user_location,
            'Likes': likes,
            'Retweets': retweets,
            'Hashtags': hashtags,
            'Sentiment': sentiment,
            'VerifiedUser': verified_user
        })

    df = pd.DataFrame(data)
    df['PostDate'] = pd.to_datetime(df['PostDate'])
    return df

# Generate and save the dataset
social_media_df = generate_social_media_data()
social_media_df.to_csv('social_media_posts.csv', index=False)

print("Synthetic social media posts dataset generated: social_media_posts.csv")

Synthetic social media posts dataset generated: social_media_posts.csv


**The Data**

The data comes from a compilation by SocialPulse, available in 'social_media_posts.csv'. Each row represents a single social media post:

**PostID** - Unique identifier for each post.

**UserID** - Unique identifier for the user who made the post.

**PostContent** - Content of the post.

**PostDate** - Date and time of the post.

**UserAge** - Age of the user.

**UserGender** - Gender of the user:

    -Male
    -Female
    -Other

**UserLocation** - Location of the user:
    -Urban
    -Suburban
    -Rural

**Likes** - Number of likes the post received.

**Retweets** - Number of retweets the post received.

**Hashtags** - Hashtags used in the post (or NaN if none).

    - #example
    - #trending
    - #news
    - #tech
    - #social
    - NaN (if none)

**Sentiment** - Sentiment of the post:

    -Positive
    -Negative
    -Neutral, or NaN

**VerifiedUser** - Indicates whether the user is verified.

    -True
    -False

## **Basic (RBT Levels: 2, 3):**

Total: 5 Marks

Each Question Carry 1 Mark

**Question 1. Missing Value Identification:**

Identify the columns in the social media dataset that contain missing values. How many missing values are present in each column?

In [None]:
# Question 1: Missing Value Identification
# Identify the columns in the social media dataset that contain missing values. How many missing values are present in each column?

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 2: Basic Missing Value Handling**

Remove all rows from the social media dataset that contain at least one missing value. How many rows are removed? Explain why you chose this approach.


In [None]:
# Question 2: Basic Missing Value Handling
# Remove all rows from the social media dataset that contain at least one missing value. How many rows are removed? Explain why you chose this approach.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 3: Data Type Conversion**

Verify the data types of each column in the social media dataset. Convert the 'Likes' and 'Retweets' columns to integer data types and the 'PostDate' column to a datetime data type. Explain why these data types are appropriate.


In [None]:
# Question 3: Data Type Conversion
# Verify the data types of each column. Convert the 'Likes' and 'Retweets' columns to integer data types and the 'PostDate' column to a datetime data type. Explain why these data types are appropriate.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 4: Renaming Columns**

Rename the 'PostID' column to 'Tweet_ID' and the 'UserID' column to 'User_ID'. Explain why renaming columns can be useful.


In [None]:
# Question 4: Renaming Columns
# Rename the 'PostID' column to 'Tweet_ID' and the 'UserID' column to 'User_ID'. Explain why renaming columns can be useful.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

Question 5: Duplicate Row Removal

Check for and remove any duplicate rows in the social media dataset. How many duplicate rows were found and removed?


In [None]:
# Question 5: Duplicate Row Removal
# Check for and remove any duplicate rows in the social media dataset. How many duplicate rows were found and removed?

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

##**Intermediate (RBT Levels: 3, 4):**

Total: 10 Marks

Each Question Carry 2 Marks



**Question 6: Targeted Missing Value Imputation**

Impute the missing values in the 'Sentiment' column with the most frequent value (mode). Explain why you chose this imputation method.

In [None]:
# Question 6: Targeted Missing Value Imputation
# Impute the missing values in the 'Sentiment' column with the most frequent value (mode). Explain why you chose this imputation method.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]



Impute the missing values in the 'Hashtags' column with the string 'No Hashtags'. Explain why you chose this imputation method.


In [None]:
# Impute the missing values in the 'Hashtags' column with the string 'No Hashtags'. Explain why you chose this imputation method.

# Your Code Here:
# ... your code here ...



**Explanation**

[Your explanation here]

**Question 7: Binning Numerical Data and Visualization**

Create a new categorical column called 'AgeGroup' by binning the 'UserAge' column into appropriate age ranges (e.g., 18-25, 26-35, 36-45, 46+). Explain your binning strategy. Create a bar chart showing the distribution of users in each age group.

In [None]:
# Question 7: Binning Numerical Data and Visualization
# Create a new categorical column called 'AgeGroup' by binning the 'UserAge' column into appropriate age ranges (e.g., 18-25, 26-35, 36-45, 46+). Explain your binning strategy. Create a bar chart showing the distribution of users in each age group.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

Create a new categorical column called 'EngagementCategory' by binning the 'Likes' column into quantiles. Explain your binning strategy. Create a boxplot chart showing the distribution of 'Retweets' based on 'EngagementCategory'.

In [None]:
# Create a new categorical column called 'EngagementCategory' by binning the 'Likes' column into quantiles. Explain your binning strategy. Create a boxplot chart showing the distribution of 'Retweets' based on 'EngagementCategory'.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 8: Outlier Detection and Removal**

Use the IQR method to identify and remove outliers from the 'Likes' and 'Retweets' columns. Explain your outlier detection and removal process.


In [None]:
# Question 8: Outlier Detection and Removal
# Use the IQR method to identify and remove outliers from the 'Likes' and 'Retweets' columns. Explain your outlier detection and removal process.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 9: String Manipulation**

Clean the 'Hashtags' column by removing any leading or trailing whitespace. Convert all values to lowercase to ensure consistency.

In [None]:
# Question 9: String Manipulation
# Clean the 'Hashtags' column by removing any leading or trailing whitespace. Convert all values to lowercase to ensure consistency.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 10: Dummy Variable Creation and Stacked Bar Plot**

Create dummy variables for the 'UserGender' and 'UserLocation' columns. Explain how dummy variables are used in data analysis. Create a stacked bar plot to visualize the distribution of 'UserGender' within each 'UserLocation'.


In [None]:
# Question 10: Dummy Variable Creation and Stacked Bar Plot
# Create dummy variables for the 'UserGender' and 'UserLocation' columns. Explain how dummy variables are used in data analysis. Create a stacked bar plot to visualize the distribution of 'UserGender' within each 'UserLocation'.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

##**Advanced (RBT Levels: 4, 5):**

Total: 15 Marks

Each Question Carry 3 Marks

**Question 11: Conditional Missing Value Imputation**

Impute missing values in the 'Sentiment' column. If 'Hashtags' is NaN, impute 'Sentiment' with 'Neutral'. Otherwise, impute with the most frequent sentiment among posts with hashtags. Explain your approach.

In [None]:
# Question 11: Conditional Missing Value Imputation
# Impute missing values in the 'Sentiment' column. If 'Hashtags' is NaN, impute 'Sentiment' with 'Neutral'. Otherwise, impute with the most frequent sentiment among posts with hashtags. Explain your approach.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 12: Custom Binning Function**

Write a custom function to create an 'EngagementLevel' column based on the combined 'Likes' and 'Retweets' scores. Categorize scores below 100 as 'Low', scores between 100 and 300 as 'Medium', and scores above 300 as 'High'. Apply this function to create the new column.

In [None]:
# Question 12: Custom Binning Function
# Write a custom function to create an 'EngagementLevel' column based on the combined 'Likes' and 'Retweets' scores. Categorize scores below 100 as 'Low', scores between 100 and 300 as 'Medium', and scores above 300 as 'High'. Apply this function to create the new column.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 13: Grouped Transformations and Line Chart**

Calculate the average 'Likes' for each 'UserLocation'. Then create a new column called 'LikesNormalized' that represents each post's 'Likes' as a z-score relative to its user location's average. Create a line chart visualizing the average normalized Likes across user locations sorted by average normalized Likes.


In [None]:
# Question 13: Grouped Transformations and Line Chart
# Calculate the average 'Likes' for each 'UserLocation'. Then create a new column called 'LikesNormalized' that represents each post's 'Likes' as a z-score relative to its user location's average. Create a line chart visualizing the average normalized Likes across user locations sorted by average normalized Likes.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 14: Data Sampling and Validation**

Randomly sample 30% of the dataset. Use this sample to calculate the average 'Retweets' for each 'UserGender'. Compare these means to the means calculated using the entire dataset. Discuss any differences and their potential implications.

In [None]:
# Question 14: Data Sampling and Validation
# Randomly sample 30% of the dataset. Use this sample to calculate the average 'Retweets' for each 'UserGender'. Compare these means to the means calculated using the entire dataset. Discuss any differences and their potential implications.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Question 15: Merging Hypothetical Data**

Imagine you have a second dataset with user profile information (e.g., follower count, profile description). Merge this hypothetical dataset with the social media dataset using the 'UserID' column as a key. Explain your merge strategy and how this merged data could be used for further analysis.


In [None]:
# Question 15: Merging Hypothetical Data
# Imagine you have a second dataset with user profile information (e.g., follower count, profile description). Merge this hypothetical dataset with the social media dataset using the 'UserID' column as a key. Explain your merge strategy and how this merged data could be used for further analysis.

# Your Code Here:
# ... your code here ...


**Explanation**

[Your explanation here]

**Report**

**Part 1**

- In this section, compile the explanation of each of the questions.

**Part 2**
- Answer the following data analysis questions:
  1. What are the key characteristics of the user base and their posts in this dataset?
  2. Which factors appear to have the strongest influence on post engagement (likes, retweets)?
  3. What are the most common missing data patterns, and what implications might they have on our analysis?
  4. Based on your analysis, what are 2-3 recommendations you would make to improve post engagement or user reach?

##**Answers**