# Week 6 – Self-Guided Notebook (Holiday Week): Data Import + Visualisation Practice

🎯 Goal: You’ll explore a dataset with 150 food items and their nutritional values using the libraries Pandas, NumPy, Matplotlib, and Seaborn.

### 📁 STEP 1 – How to Upload Your .csv File in Google Colab

If you’re using Google Colab, you must upload the **nutrition_data_150.csv** file before running the code. The data consists of comma-separated values (known as CSV format), which we will convert in a DataFrame using Pandas.

Run the next cell and use the dialog to upload the file "nutrition_data_150.csv"

In [None]:
from google.colab import files

# Run this cell and click on the button that will appear below.
# A dialog box will ask you to select the file "nutrition_data_150.csv"
uploaded = files.upload()

### STEP 2 – Load and Explore the Dataset

🔍 What’s inside the file?

Let’s import the libraries and load the .csv file.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Import/load the dataset (it should be in the same folder)
imported_df = pd.read_csv('nutrition_data_150.csv')

# Display the first 5 rows
print("First 5 rows of the imported DataFrame:")
print(imported_df.head())

This code reads the `nutrition_data_150.csv` file into a pandas DataFrame.

- `pd.read_csv(...)`: loads the table.
- `.head()`: shows the first 5 rows of the table to give you a preview. You can display a specific number of rows if you type it as an input in the brackets. For example, you could say `.head(15)` and print the first 15 rows.

### STEP 3 – Descriptive Statistics

Run the next cell to print the basic  of the DataFrame.

In [None]:
print("Statistical Summary:")
print(imported_df.describe())

`.describe()` gives summary statistics for numeric columns:

- count = number of entries
- mean = average
- std = standard deviation
- min/max = range
- 25%, 50%, 75% = quartiles

### STEP 4 – Plot a Histogram

Let’s see how calories are distributed across all food items.

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(imported_df['Calories (kcal)'], bins=20, color='skyblue', edgecolor='black')
plt.title('Histogram of Calories')
plt.xlabel('Calories (kcal)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

- Each bar shows how many foods fall into a specific calorie range.
- This helps you understand the overall distribution (e.g., are most foods low or high calorie?).

In [None]:
# Filter for a specific food item, e.g., 'Avocado'
avocado_df = imported_df[imported_df['Food Item'] == 'Avocado']

plt.figure(figsize=(10, 6))
plt.hist(avocado_df['Calories (kcal)'], bins=10, color='salmon', edgecolor='black')
plt.title('Calories in Avocados')
plt.xlabel('Calories (kcal)')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# ✅ Focuses only on the calorie variation among avocado samples.

### STEP 5 – Scatter Plot: Protein vs. Fat

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(imported_df['Protein (g)'], imported_df['Fat (g)'], color='green', alpha=0.5)
plt.title('Scatter Plot of Protein vs. Fat')
plt.xlabel('Protein (g)')
plt.ylabel('Fat (g)')
plt.grid(True)
plt.show()

- Each point represents a food item.
- You can look for relationships: do foods with more protein also have more fat?
- `alpha=0.5` makes overlapping dots more transparent. Try different values of alpha between 0.0 and 1.0.

In [None]:
plt.figure(figsize=(10, 6))
plt.scatter(imported_df['Protein (g)'], imported_df['Calories (kcal)'], color='green', alpha=0.5)
plt.title('Scatter Plot of Protein vs. Calories')
plt.xlabel('Protein (g)')
plt.ylabel('Calories (kcal)')
plt.grid(True)
plt.show()

### STEP 6 – Box Plot: Calories Across Food Types

In [None]:
plt.figure(figsize=(10, 6))
sns.boxplot(x='Food Item', y='Calories (kcal)', data=imported_df)
plt.title('Calories Distribution Across Food Items')
plt.xlabel('Food Item')
plt.ylabel('Calories (kcal)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

**What is a boxplot?**
A boxplot helps you see:

- The **median** (middle line)
- The **quartiles** (box edges)
- **Outliers** (dots outside whiskers)
- It helps compare calorie spread between food types.

Make sure the 'Food Item' column contains only a few unique values. If not, the x-axis might be unreadable.

### Violin Plot of Calories Across Food Items
This shows both the distribution *shape* and spread for each food type.

In [None]:
plt.figure(figsize=(12, 6))
sns.violinplot(x='Food Item', y='Calories (kcal)', data=imported_df, inner='quartile')
plt.title('Calories Distribution by Food Item (Violin Plot)')
plt.xlabel('Food Item')
plt.ylabel('Calories (kcal)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Looking at the plots, can you tell which food has the more variance in calorie content? And which one has a bimodal distribution in their calorie content?

Try it Yourself!
1.	Try changing 'Calories (kcal)' to 'Protein (g)' in the plots.
2.	Filter the DataFrame to show only food items above 250 kcal.
3.	Sort the foods by protein content using df.sort_values('Protein (g)').