**Assignment Submission Guidelines**

**1. Submission Platform:**

- Submit your completed assignment through Google Classroom.

**2. Submission Format:**

- Submit the Google Colab Notebook (.ipynb file) provided as the assignment template.
- Do not create a new notebook. Fill in the provided template.

**3. Template Completion:**

The template notebook contains:
- The code to generate the Global Food Security Analysis csv dataset.
- Placeholders for your code and explanations for each question.

Follow the instructions within the template.
- Code Cells:
  - Place your code solutions directly in the designated code cells below each question.
- Markdown Cells:
  - Provide your explanations and justifications in the designated Markdown cells.
- Report section:
  - Complete the markdown section at the bottom of the notebook titled "Report".
  - In this section, compile the explanation of each of the questions.
  - Answer the following data analysis questions:
    1.  What are the key characteristics of global food security based on this dataset?
    2. Which factors appear to have the strongest influence on food security?
    3. What are the most common missing data patterns, and what implications might they have?
    4. Based on your analysis, what are 2-3 recommendations you would make to improve global food security?

- Do not modify the structure of the template notebook.

**4. File Naming:**

Ensure the file name remains as provided in the template. Do not rename the file.

**5. Timely Submission:**

- Submit your completed template notebook by the deadline: **24th of March, 2025**.
- Late submissions will be penalized as follows:
- Submissions within **5:00pm 26th of March, 2025**  will receive a maximum of 5 marks for timely submission.
Submissions after  will receive 0 marks for timely submission.

**6. Report:**

- Complete the "Report" section at the end of your notebook.
- Ensure your report is:
  - Well-organized and easy to read.
  - Clear and concise.
  - Free of grammatical errors.

**7. Code Execution:**

Ensure your completed notebook runs without errors from top to bottom.
Before submitting, restart the kernel and run all cells to confirm reproducibility.



**8. Academic Integrity:**

All work must be your own.
Plagiarism will result in a failing grade.
Cite any external resources you use.



**Tips for Success:**

- Start the assignment early.
- Read the instructions within the template carefully.
- Plan your approach before coding.
- Test your code thoroughly.
- Document your work clearly.
- Review the rubrics to understand the grading criteria.


**Grading Rubrics:**

Total 50 Marks

- Timely Submission: 10 Marks
- Report : 10 Marks
- Level 1 (Basic Questions): 5 Marks (1 x 5 = 5)
- Level 2 (Intermediate Questions): 10 Marks (2 x 5 = 10)
- Level 3 (Advanced Questions): 15 Marks (3 x 5 = 15)

##**Assignment**

**Background**

You are a data analyst working for "Global Food Insights," a consultancy focused on analyzing global food security and providing data-driven recommendations. Your team has been tasked with analyzing a dataset containing various indicators of food security across different countries. This dataset includes information on food availability, access, utilization, and stability.

Your goal is to leverage this data to identify key factors influencing food security and provide actionable recommendations to international organizations and governments.

In [None]:
import pandas as pd
import numpy as np
import random

def generate_food_security_data(num_countries=150, num_years=5):
    """Generates synthetic food security data with specific events."""

    countries = [
        "United States", "China", "India", "Brazil", "Russia", "Japan", "Germany", "United Kingdom",
        "France", "Italy", "Canada", "Australia", "Mexico", "Indonesia", "Nigeria", "Pakistan",
        "Egypt", "South Africa", "Turkey", "Saudi Arabia", "Argentina", "Spain", "Poland", "Netherlands",
        "Belgium", "Switzerland", "Sweden", "Norway", "Denmark", "Finland", "Ireland", "Portugal",
        "Austria", "Greece", "Hungary", "Czech Republic", "Romania", "Ukraine", "Belarus", "Kazakhstan",
        "Uzbekistan", "Vietnam", "Thailand", "Malaysia", "Singapore", "Philippines", "Bangladesh",
        "Nepal", "Sri Lanka", "Cambodia", "Laos", "Myanmar", "New Zealand", "Chile", "Colombia",
        "Peru", "Venezuela", "Ecuador", "Bolivia", "Paraguay", "Uruguay", "Algeria", "Morocco",
        "Tunisia", "Libya", "Sudan", "Kenya", "Ethiopia", "Tanzania", "Uganda", "Ghana", "Ivory Coast",
        "Cameroon", "Senegal", "Zambia", "Zimbabwe", "Angola", "Mozambique", "Botswana", "Namibia",
        "Madagascar", "Iceland", "Greenland", "Cuba", "Dominican Republic", "Haiti", "Jamaica",
        "Panama", "Costa Rica", "Nicaragua", "Honduras", "Guatemala", "El Salvador", "Kuwait",
        "Qatar", "United Arab Emirates", "Oman", "Bahrain", "Jordan", "Lebanon", "Syria", "Iraq",
        "Iran", "Afghanistan", "Mongolia", "North Korea", "South Korea", "Taiwan", "Georgia",
        "Armenia", "Azerbaijan", "Kyrgyzstan", "Tajikistan", "Turkmenistan", "Moldova", "Slovakia",
        "Slovenia", "Croatia", "Bosnia and Herzegovina", "Albania", "North Macedonia", "Serbia",
        "Montenegro", "Estonia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Cyprus"
    ]

    data = []
    for country in countries[:num_countries]:
        for year in range(2018, 2018 + num_years):
            # Generate random but somewhat realistic data
            prevalence_undernourishment = max(0, np.random.normal(10, 5))
            food_supply_quantity = max(100, np.random.normal(2500, 500))
            food_production_index = max(50, np.random.normal(110, 15))
            gdp_per_capita = max(500, np.random.normal(20000, 10000))
            political_stability = np.random.normal(0, 1)  # Scale -2.5 to 2.5
            import_dependency = max(0, np.random.normal(30, 15))
            agricultural_land = max(5, np.random.normal(40, 20))
            average_rainfall = max(100, np.random.normal(1000, 300))
            average_temperature = max(-10, np.random.normal(20, 10))

            # Generate specific events
            natural_disaster_events = []
            conflict_events = []

            if random.random() < 0.2:  # 20% chance of a natural disaster
                disaster_types = ["Flood", "Drought", "Earthquake", "Hurricane", "Wildfire"]
                natural_disaster_events.append(random.choice(disaster_types))
            if random.random() < 0.1:  # 10% chance of a conflict event
                conflict_types = ["Civil Unrest", "Localized Conflict", "Border Dispute"]
                conflict_events.append(random.choice(conflict_types))

            data.append({
                "Country": country,
                "Year": year,
                "Prevalence of Undernourishment (%)": prevalence_undernourishment,
                "Food Supply Quantity (kg/capita/year)": food_supply_quantity,
                "Food Production Index (2014-2016=100)": food_production_index,
                "GDP per Capita (USD)": gdp_per_capita,
                "Political Stability and Absence of Violence/Terrorism": political_stability,
                "Import Dependency Ratio (%)": import_dependency,
                "Agricultural Land (% of land area)": agricultural_land,
                "Average Rainfall (mm)": average_rainfall,
                "Average Temperature (Celsius)": average_temperature,
                "Natural Disaster Events": ", ".join(natural_disaster_events) if natural_disaster_events else None,
                "Conflict Events": ", ".join(conflict_events) if conflict_events else None,
            })

    df = pd.DataFrame(data)
    return df

# Generate and save the dataset
food_security_df = generate_food_security_data()
food_security_df.to_csv("food_security_data.csv", index=False)
print("Synthetic food security dataset generated: food_security_data.csv")

Synthetic food security dataset generated: food_security_data.csv


**The Data**

- **Country**: The name of the country.
- **Year**: The year for which the data is recorded.
- **Prevalence of Undernourishment (%)**: The percentage of the population suffering from undernourishment.
- **Food Supply Quantity (kg/capita/year)**: The amount of food supply available per person per year, measured in kilograms.
- **Food Production Index (2014-2016=100)**: An index representing the level of food production, with the base period of 2014-2016 set to 100.
- **GDP per Capita (USD)**: The Gross Domestic Product per capita, measured in US dollars.
- **Political Stability and Absence of Violence/Terrorism**: An index measuring the level of political stability and the absence of violence and terrorism. Higher values indicate greater stability.
- **Import Dependency Ratio (%)**: The percentage of food that is imported, indicating the country's reliance on food imports.
- **Agricultural Land (% of land area)**: The percentage of the country's total land area that is used for agricultural purposes.
- **Average Rainfall (mm)**: The average annual rainfall in millimeters.
- **Average Temperature (Celsius)**: The average annual temperature in degrees Celsius.
- **Natural Disaster Events**: A comma-separated string listing specific natural disaster events that occurred in the country during the given year (e.g., "Flood, Drought"). If no natural disasters occurred, this column contains None.
- **Conflict Events**: A comma-separated string listing specific conflict events that occurred in the country during the given year (e.g., "Civil Unrest"). If no conflict events occurred, this column contains None.


## **Basic (RBT Levels: 2, 3):**

Total: 5 Marks

Each Question Carry 1 Mark

**Question 1. Missing Value Identification:**

Identify the columns in the dataset that contain missing values. How many missing values are present in each column?

In [None]:
# Question 1: Missing Value Identification
# Identify the columns in the dataset that contain missing values. How many missing values are present in each column?
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 2: Basic Missing Value Handling**

Remove all rows that contain at least one missing value. How many rows are removed? Explain why you chose this approach.


In [None]:
# Question 2: Basic Missing Value Handling
# Remove all rows that contain at least one missing value. How many rows are removed? Explain why you chose this approach.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 3: Data Type Conversion**

Verify the data types of each column. Convert the 'Year' column to an integer data type and the 'Prevalence of Undernourishment (%)' column to a float data type. Explain why these data types are appropriate.


In [None]:
# Question 3: Data Type Conversion
# Verify the data types of each column. Convert the 'Year' column to an integer data type and the 'Prevalence of Undernourishment (%)' column to a float data type.
# Explain why these data types are appropriate.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 4: Renaming Columns**

Rename the 'Prevalence of Undernourishment (%)' column to 'Undernourishment_Percentage' and the 'GDP per Capita (USD)' column to 'GDP_Per_Capita'. Explain why renaming columns can be useful.


In [None]:
# Question 4: Renaming Columns
# Rename the 'Prevalence of Undernourishment (%)' column to 'Undernourishment_Percentage' and the 'GDP per Capita (USD)' column to 'GDP_Per_Capita'.
# Explain why renaming columns can be useful.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 5: Duplicate Row Removal**

Check for and remove any duplicate rows in the dataset. How many duplicate rows were found and removed?




In [None]:
# Question 5: Duplicate Row Removal
# Check for and remove any duplicate rows in the dataset. How many duplicate rows were found and removed?
# Your Code Here:

**Explanation**

[Your explanation here]

##**Intermediate (RBT Levels: 3, 4):**

Total: 10 Marks

Each Question Carry 2 Marks



**Question 6: Targeted Missing Value Imputation**

Impute the missing values in the 'Political Stability and Absence of Violence/Terrorism' column with the mean. Explain why you chose this imputation method.




In [None]:
# Question 6: Targeted Missing Value Imputation
# Impute the missing values in the 'Political Stability and Absence of Violence/Terrorism' column with the mean.
# Explain why you chose this imputation method.
# Your Code Here:

**Explanation**

[Your explanation here]

Impute the missing values in the 'Import Dependency Ratio (%)' and 'Agricultural Land (% of land area)' columns with the median. Explain why you chose this imputation method.

In [None]:
# Impute the missing values in the 'Import Dependency Ratio (%)' and 'Agricultural Land (% of land area)' columns with the median.
# Explain why you chose this imputation method.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 7: Binning Numerical Data and Visualization**

Create a new categorical column called 'GDP_Category' by binning the 'GDP_Per_Capita' column into quantiles. Explain your binning strategy. Create a boxplot chart showing the distribution of 'Undernourishment_Percentage' based on 'GDP_Category'.

In [None]:
# Question 7: Binning Numerical Data and Visualization
# Create a new categorical column called 'GDP_Category' by binning the 'GDP_Per_Capita' column into quantiles.
# Explain your binning strategy. Create a boxplot chart showing the distribution of 'Undernourishment_Percentage' based on 'GDP_Category'.
# Your Code Here:

**Explanation**

[Your explanation here]

Create a new categorical column called 'RainfallCategory' by binning the "Average Rainfall (mm)" into 3 equal sized bins. Create a bar chart showing the distribution of countries in each 'RainfallCategory'.

In [None]:
# Create a new categorical column called 'RainfallCategory' by binning the "Average Rainfall (mm)" into 3 equal sized bins.
# Create a bar chart showing the distribution of countries in each 'RainfallCategory'.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 8: Outlier Detection and Removal**

Use the IQR method to identify and remove outliers from the 'Undernourishment_Percentage' and 'Food Supply Quantity (kg/capita/year)' columns. Explain your outlier detection and removal process.


In [None]:
# Question 8: Outlier Detection and Removal
# Use the IQR method to identify and remove outliers from the 'Undernourishment_Percentage'
# And 'Food Supply Quantity (kg/capita/year)' columns.
# Explain your outlier detection and removal process.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 9: String Manipulation**

Clean the 'Country' column by removing any leading or trailing whitespace. Convert all values to lowercase to ensure consistency.


In [None]:
# Question 9: String Manipulation
# Clean the 'Country' column by removing any leading or trailing whitespace.
# Convert all values to lowercase to ensure consistency.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 10: Dummy Variable Creation and Stacked Bar Plot**

Create dummy variables for the presence of natural disasters (i.e., if the 'Natural Disaster Events' column is not None). Create a stacked bar plot to visualize the distribution of 'GDP_Category' based on the presence of natural disasters.

In [None]:
# Question 10: Dummy Variable Creation and Stacked Bar Plot
# Create dummy variables for the presence of natural disasters (i.e., if the 'Natural Disaster Events' column is not None).
# Create a stacked bar plot to visualize the distribution of 'GDP_Category' based on the presence of natural disasters.
# Your Code Here:

**Explanation**

[Your explanation here]

##**Advanced (RBT Levels: 4, 5):**

Total: 15 Marks

Each Question Carry 3 Marks

**Question 11: Conditional Missing Value Imputation**

If 'GDP_Per_Capita' is NaN, impute 'Undernourishment_Percentage' with the global mean. Otherwise, impute with the mean of the existing values for the particular 'GDP_Category'. Explain your approach.

In [None]:
# Question 11: Conditional Missing Value Imputation
# If 'GDP_Per_Capita' is NaN, impute 'Undernourishment_Percentage' with the global mean.
# Otherwise, impute with the mean of the existing values for the particular 'GDP_Category'.
# Explain your approach.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 12: Custom Binning Function**

Write a custom function to create a 'FoodSecurityLevel' column based on the 'Undernourishment_Percentage'. Categorize percentages below 5 as 'High Security', percentages between 5 and 20 as 'Medium Security', and percentages above 20 as 'Low Security'. Apply this function to create the new column.

In [None]:
# Question 12: Custom Binning Function
# Write a custom function to create a 'FoodSecurityLevel' column based on the 'Undernourishment_Percentage'.
# Categorize percentages below 5 as 'High Security', percentages between 5 and 20 as 'Medium Security',
# And percentages above 20 as 'Low Security'.
# Apply this function to create the new column.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 13: Grouped Transformations and Line Chart**

Calculate the average 'Undernourishment_Percentage' for each 'Year'. Then create a new column called 'UndernourishmentNormalized' that represents each country's 'Undernourishment_Percentage' as a z-score relative to the year's average. Create a line chart visualizing the average normalized Undernourishment across years.


In [None]:
# Question 13: Grouped Transformations and Line Chart
# Calculate the average 'Undernourishment_Percentage' for each 'Year'.
# Then create a new column called 'UndernourishmentNormalized' that represents each country's 'Undernourishment_Percentage' as a z-score relative to the year's average.
# Create a line chart visualizing the average normalized Undernourishment across years.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 14: Data Sampling and Validation**

Randomly sample 20% of the dataset. Use this sample to calculate the mean 'Food Supply Quantity (kg/capita/year)' for each 'GDP_Category'. Compare these means to the means calculated using the entire dataset. Discuss any differences and their potential implications.

In [None]:
# Question 14: Data Sampling and Validation
# Randomly sample 20% of the dataset. Use this sample to calculate the mean 'Food Supply Quantity (kg/capita/year)' for each 'GDP_Category'.
# Compare these means to the means calculated using the entire dataset.
# Discuss any differences and their potential implications.
# Your Code Here:

**Explanation**

[Your explanation here]

**Question 15: Merging Hypothetical Data**

Imagine you have a second dataset with information on international aid for food security (e.g., aid amounts per country per year). Merge this hypothetical dataset with the food security dataset using the 'Country' and 'Year' columns as keys. Explain your merge strategy and how this merged data could be used for further analysis.

In [None]:
# Question 15: Merging Hypothetical Data
# Imagine you have a second dataset with information on international aid for food security (e.g., aid amounts per country per year).
# Merge this hypothetical dataset with the food security dataset using the 'Country' and 'Year' columns as keys.
# Explain your merge strategy and how this merged data could be used for further analysis.
# Your Code Here:

**Explanation**

[Your explanation here]

**Report**

**Part 1**

- In this section, compile the explanation of each of the questions.

**Part 2**

- Answer the following data analysis questions:
  1. What are the key characteristics of global food security based on this dataset?
  2. Which factors appear to have the strongest influence on food security?
  3. What are the most common missing data patterns, and what implications might they have?
  4. Based on your analysis, what are 2-3 recommendations you would make to improve global food security?