**Assignment Submission Guidelines**

**1. Submission Platform:**

- Submit your completed assignment through Google Classroom

**2. Submission Format:**

- Submit the Google Colab Notebook (.ipynb file) provided as the assignment template.
- Do not create a new notebook. Fill in the provided template.

**3. Template Completion:**

The template notebook contains:
- The code to generate the World Values Survey csv dataset.
- Placeholders for your code and explanations for each question.

Follow the instructions within the template.
- Code Cells:
  - Place your code solutions directly in the designated code cells below each question.
- Markdown Cells:
  - Provide your explanations and justifications in the designated Markdown cells.
- Report section:
  - Complete the markdown section at the bottom of the notebook titled "Report".
  - In this section, compile the explanation of each of the questions.
  - Answer the following data analysis questions:
    1.   What are the key characteristics of global values and beliefs based on this dataset?
    2. Which factors appear to have the strongest influence on life satisfaction?
    3. What are the most common missing data patterns, and what implications might they have?
    4. Based on your analysis, what are 2-3 recommendations you would make to improve social well-being?
- Do not modify the structure of the template notebook.

**4. File Naming:**

Ensure the file name remains as provided in the template. Do not rename the file.

**5. Timely Submission:**

- Submit your completed template notebook by the deadline: **24th of March, 2025**.
- Late submissions will be penalized as follows:
- Submissions within **5:00pm 26th of March, 2025**  will receive a maximum of 5 marks for timely submission.
Submissions after  will receive 0 marks for timely submission.

**6. Report:**

- Complete the "Report" section at the end of your notebook.
- Ensure your report is:
  - Well-organized and easy to read.
  - Clear and concise.
  - Free of grammatical errors.

**7. Code Execution:**

Ensure your completed notebook runs without errors from top to bottom.
Before submitting, restart the kernel and run all cells to confirm reproducibility.



**8. Academic Integrity:**

All work must be your own.
Plagiarism will result in a failing grade.
Cite any external resources you use.



**Tips for Success:**

- Start the assignment early.
- Read the instructions within the template carefully.
- Plan your approach before coding.
- Test your code thoroughly.
- Document your work clearly.
- Review the rubrics to understand the grading criteria.


**Grading Rubrics:**

Total 50 Marks

- Timely Submission: 10 Marks
- Report : 10 Marks
- Level 1 (Basic Questions): 5 Marks (1 x 5 = 5)
- Level 2 (Intermediate Questions): 10 Marks (2 x 5 = 10)
- Level 3 (Advanced Questions): 15 Marks (3 x 5 = 15)

##**Assignment**

**Background**

You are a social scientist working for an international research institute. Your team is tasked with analyzing the World Values Survey data to understand how values and beliefs vary across different countries and regions. Your goal is to identify key trends and patterns that can inform social and political policies.

In [None]:
import pandas as pd
import numpy as np
import random

def generate_wvs_data(num_respondents=1000, num_countries=30, num_waves=3):
    """Generates synthetic World Values Survey data with numerical columns."""

    countries = [
        "United States", "China", "India", "Brazil", "Russia", "Japan", "Germany", "United Kingdom",
        "France", "Italy", "Canada", "Australia", "Mexico", "Indonesia", "Nigeria", "Pakistan",
        "Egypt", "South Africa", "Turkey", "Saudi Arabia", "Argentina", "Spain", "Poland", "Netherlands",
        "Sweden", "Norway", "Denmark", "Finland", "Ireland", "Portugal"
    ]

    waves = list(range(1, num_waves + 1))

    data = []
    for i in range(num_respondents):
        country = random.choice(countries)
        wave = random.choice(waves)

        # Generate categorical values
        trust_family = random.choice(["Very much", "Quite a lot", "Not very much", "Not at all"])
        importance_religion = random.choice(["Very important", "Quite important", "Not very important", "Not at all important"])
        political_interest = random.choice(["Very interested", "Somewhat interested", "Not very interested", "Not at all interested"])
        life_satisfaction = random.choice(["Very satisfied", "Satisfied", "Dissatisfied", "Very dissatisfied"])
        importance_friends = random.choice(["Very important", "Quite important", "Not very much", "Not at all important"])
        attitude_homosexuality = random.choice(["Strongly disagree", "Disagree", "Agree", "Strongly agree"])
        confidence_government = random.choice(["A great deal", "Quite a lot", "Not very much", "Not at all"])
        env_vs_econ = random.choice(["Economic growth is more important", "Environmental protection is more important", "Both are equally important", "Neither is important"])
        gender = random.choice(["Male", "Female", "Other"])
        education = random.choice(["Primary", "Secondary", "Tertiary", "Postgraduate"])
        income_level = random.choice(["Low", "Medium", "High"])
        internet_access = random.choice(["Yes", "No"])
        trust_neighbors = random.choice(["Very much", "Quite a lot", "Not very much", "Not at all"])
        importance_leisure = random.choice(["Very important", "Quite important", "Not very much", "Not at all important"])

        # Generate numerical values
        age = random.randint(18, 85)
        number_children = random.randint(0, 5)
        years_education = random.randint(6, 20)
        household_income = round(np.random.normal(50000, 20000),2)

        data.append({
            "country": country,
            "wave": wave,
            "trust_family": trust_family,
            "importance_religion": importance_religion,
            "political_interest": political_interest,
            "life_satisfaction": life_satisfaction,
            "importance_friends": importance_friends,
            "attitude_homosexuality": attitude_homosexuality,
            "confidence_government": confidence_government,
            "env_vs_econ": env_vs_econ,
            "gender": gender,
            "education": education,
            "income_level": income_level,
            "internet_access": internet_access,
            "trust_neighbors": trust_neighbors,
            "importance_leisure": importance_leisure,
            "age": age,
            "number_children": number_children,
            "years_education": years_education,
            "household_income": household_income
        })

    df = pd.DataFrame(data)
    return df

# Generate and save the dataset
wvs_df = generate_wvs_data()
wvs_df.to_csv("wvs_synthetic_data.csv", index=False)
print("Synthetic World Values Survey dataset generated: wvs_synthetic_data.csv")

Synthetic World Values Survey dataset generated: wvs_synthetic_data.csv


**The Data**

Here's the data description for the synthetic World Values Survey dataset:

- country: The country where the survey was conducted.
wave: The wave of the survey.
- trust_family: Respondent's level of trust in their family.
  - Very much
  - Quite a lot
  - Not very much
  - Not at all
importance_religion: Respondent's perceived importance of religion.
  - Very important
  - Quite important
  - Not very important
  - Not at all important
- political_interest: Respondent's level of political interest.
  - Very interested
  - Somewhat interested
  - Not very interested
  - Not at all interested
- life_satisfaction: Respondent's level of life satisfaction.
   - Very satisfied
   - Satisfied
   - Dissatisfied
   - Very dissatisfied
- importance_friends: Respondent's perceived importance of friends
  - Very important
  - Quite important
  - Not very much
  - Not at all important
- attitude_homosexuality: Respondent's attitude towards homosexuality
  - Strongly disagree
  - Disagree
  - Agree
  - Strongly agree
- confidence_government: Respondent's confidence in the government.
  - A great deal
  - Quite a lot
  - Not very much
  - Not at all
- env_vs_econ: Respondent's view on environmental protection vs. economic growth.
  - Economic growth is more important
  - Environmental protection is more important
  - Both are equally important
  - Neither is important
- gender: Respondent's gender
  - Male
  - Female
  - Other
- education: Respondent's highest level of education  
  - Primary
  - Secondary
  - Tertiary
  - Postgraduate
income_level: Respondent's income level.
  - Low
  - Medium
  - High
- internet_access: Respondent's access to the internet
  - Yes
  - No
- trust_neighbors: Respondent's trust in their neighbors.
   - Very much
   - Quite a lot
   - Not very much
   - Not at all
- importance_leisure: Respondent's perceived importance of leisure time.
  - Very important
  - Quite important
  - Not very much
  - Not at all important
- age: Respondent's age (numerical).
- number_children: Number of children the respondent has (numerical).
- years_education: Years of education the respondent has received (numerical).
- household_income: Household income (numerical).

## **Basic (RBT Levels: 2, 3):**

Total: 5 Marks

Each Question Carry 1 Mark

**Question 1. Missing Value Identification and Basic Handling:**

Identify the columns in the dataset that contain missing values. How many missing values are present in each column?

In [None]:
# Question 1: Missing Value Identification
# Identify the columns in the dataset that contain missing values. How many missing values are present in each column?
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 2: Data Type Validation and Conversion:**

Remove all rows that contain at least one missing value. How many rows are removed? Explain why you chose this approach.

In [None]:
# Question 2: Basic Missing Value Handling
# Remove all rows that contain at least one missing value. How many rows are removed? Explain why you chose this approach.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 3: Basic Descriptive Statistics:**

Verify the data types of each column. Convert the 'wave', 'age', 'number_children' and 'years_education' columns to integer data types and the 'household_income' column to a float datatype. Explain why these data types are appropriate.


In [None]:
# Question 3: Data Type Conversion
# Verify the data types of each column. Convert the 'wave', 'age', 'number_children' and 'years_education' columns to integer data types and the 'household_income' column to a float datatype.
# Explain why these data types are appropriate.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 4: Renaming Columns**

Rename columns like 'trust_family' to 'Trust_Family' and 'life_satisfaction' to 'Life_Satisfaction'. Explain why renaming columns can be useful.


In [None]:
# Question 4: Renaming Columns
# Rename columns like 'trust_family' to 'Trust_Family' and 'life_satisfaction' to 'Life_Satisfaction'.
# Explain why renaming columns can be useful.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 5: Duplicate Row Removal**

Check for and remove any duplicate rows in the dataset. How many duplicate rows were found and removed?


In [None]:
# Question 5: Duplicate Row Removal
# Check for and remove any duplicate rows in the dataset. How many duplicate rows were found and removed?
# Your Code Here:

**Explanation**

[Your explanation here]

##**Intermediate (RBT Levels: 3, 4):**

Total: 10 Marks

Each Question Carry 2 Marks



**Question 6: Targeted Missing Value Imputation**

Impute missing values in the 'years_education' column with the median. Explain why you chose this imputation method. Impute missing values in the 'household_income' column with the mean. Explain your method.

In [None]:
# Question 6: Targeted Missing Value Imputation
# Impute missing values in the 'years_education' column with the median.
# Explain why you chose this imputation method.
# Impute missing values in the 'household_income' column with the mean.
# Explain your method.
# Your Code Here:

**Explanation**

[Your explanation here]

**Explanation**

[Your explanation here]

**Question 7: Binning Numerical Data and Visualization**

Create a new categorical column called 'Age_Category' by binning the 'age' column into appropriate categories (e.g., young, middle-aged, elderly). Explain your binning strategy. Create a bar chart showing the distribution of respondents in each category.

In [None]:
# Question 7: Binning Numerical Data and Visualization
# Create a new categorical column called 'Age_Category' by binning the 'age' column into appropriate categories (e.g., young, middle-aged, elderly).
# Explain your binning strategy.
# Create a bar chart showing the distribution of respondents in each category.
# Your Code Here:

**Explanation**

[Your explanation here]

Create a box plot to show the distribution of 'household_income' based on different 'education' categories.

In [None]:
# Create a box plot to show the distribution of 'household_income' based on different 'education' categories.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 8: Outlier Detection and Removal**

Use the IQR method to identify and remove outliers from the 'years_education' and 'household_income' columns. Explain your outlier detection and removal process.


In [None]:
# Question 8: Outlier Detection and Removal
# Use the IQR method to identify and remove outliers from the 'years_education' and 'household_income' columns.
# Explain your outlier detection and removal process.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 9: String Manipulation**

Clean the 'country' column by removing any leading or trailing whitespace. Convert all values to lowercase to ensure consistency.


In [None]:
# Question 9: String Manipulation
# Clean the 'country' column by removing any leading or trailing whitespace.
# Convert all values to lowercase to ensure consistency.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 10: Dummy Variable Creation and Stacked Bar Plot**

Create dummy variables for the 'gender' column. Create a stacked bar plot to visualize the distribution of 'Life_Satisfaction' based on 'gender'.

In [None]:
# Question 10: Dummy Variable Creation and Stacked Bar Plot
# Create dummy variables for the 'gender' column.
# Create a stacked bar plot to visualize the distribution of 'Life_Satisfaction' based on 'gender'.
# Your Code Here:

**Explanation**

[Your explanation here]

##**Advanced (RBT Levels: 4, 5):**

Total: 15 Marks

Each Question Carry 3 Marks

**Question 11: Conditional Missing Value Imputation**

Impute missing values in 'household_income' based on 'education'. If education is primary, impute with the global low average, if education is postgraduate, impute with global high average.

In [None]:
# Question 11: Conditional Missing Value Imputation
# Impute missing values in 'household_income' based on 'education'.
# If education is primary, impute with the global low average, if education is postgraduate, impute with global high average.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 12: Custom Binning Function**

Write a custom function to create a 'Religion_Importance_Category' column based on the 'importance_religion' column. Categorize values into 'Low', 'Medium', and 'High'. Apply this function to create the new column.


In [None]:
# Question 12: Custom Binning Function
# Write a custom function to create a 'Religion_Importance_Category' column based on the 'importance_religion' column.
# Categorize values into 'Low', 'Medium', and 'High'.
# Apply this function to create the new column.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 13: Grouped Transformations and Line Chart**

Calculate the average 'household_income' for each 'wave'. Then create a new column called 'household_income_Normalized' that represents each respondent's 'household_income' as a z-score relative to the wave's average. Create a line chart visualizing the average normalized household income across waves.


In [None]:
# Question 13: Grouped Transformations and Line Chart
# Calculate the average 'household_income' for each 'wave'.
# Then create a new column called 'household_income_Normalized' that represents each respondent's 'household_income' as a z-score relative to the wave's average.
# Create a line chart visualizing the average normalized household income across waves.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 14: Data Sampling and Validation**

Randomly sample 20% of the dataset. Use this sample to calculate the mean 'attitude_homosexuality' for each 'country'. Compare these means to the means calculated using the entire dataset. Discuss any differences and their potential implications.

In [None]:
# Question 14: Data Sampling and Validation
# Randomly sample 20% of the dataset. Use this sample to calculate the mean 'attitude_homosexuality' for each 'country'.
# Compare these means to the means calculated using the entire dataset.
# Discuss any differences and their potential implications.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 15: Merging Hypothetical Data**

Imagine you have a second dataset with economic indicators (e.g., GDP per capita) for each country and wave. Merge this hypothetical dataset with the WVS dataset using the 'country' and 'wave' columns as keys. Explain your merge strategy and how this merged data could be used for further analysis.

In [None]:
# Question 15: Merging Hypothetical Data
# Imagine you have a second dataset with economic indicators (e.g., GDP per capita) for each country and wave.
# Merge this hypothetical dataset with the WVS dataset using the 'country' and 'wave' columns as keys.
# Explain your merge strategy
# And how this merged data could be used for further analysis.
# Your Code Here:

**Explanation**

[Your explanation here]

**Report**

**Part 1**

- In this section, compile the explanation of each of the questions.

**Part 2**

- Answer the following data analysis questions:
  1. What are the key characteristics of global values and beliefs based on this dataset?
  2. Which factors appear to have the strongest influence on life satisfaction?
  3. What are the most common missing data patterns, and what implications might they have?
  4. Based on your analysis, what are 2-3 recommendations you would make to improve social well-being?