<a href="https://colab.research.google.com/github/Elberthyindas/Data110_Fall_2025/blob/main/Week2_ElberthYindas_Assignment_Fall_2025.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 2 — In‑Class Python Work & Assignment
*Prepared: September 12, 2025*

This notebook combines everything covered **in class (Week 2)** and the **assignment** you will submit: run cells top‑to‑bottom, add your answers where prompted, and submit this file.

### Contents
1. [Part A — In‑Class Python Work](#part-a)
2. [Part B — Assignment: Anscombe’s Quartet](#part-b)


---

<a id='part-a'></a>

## Part A — In‑Class Python Work

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


The dataset we are using is the World Happiness Report dataset. which can be found in the file https://github.com/Reben80/Data110-21843--fall25/tree/main/dataset

In [None]:
# Load the dataset but before that we need to download the dataset from the link above and upload it to the colab file
df = pd.read_csv('/content/happiness_2017.csv')

# or you can just copy the link of the dataset and read as bellow
# df = pd.read_csv('https://raw.githubusercontent.com/Reben80/Data110-21843--fall25/refs/heads/main/dataset/happiness.csv')


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# 1. View the first few rows
print(df.head())





In [None]:
# 2. Get summary statistics
print(df.describe())


In [None]:
# 3. Check the shape (rows, columns)
print(df.shape)

In [None]:

# 4. View column names
print(df.columns)


In [None]:

# 5. Get data types and info
print(df.info())



In [None]:



# 6. Count unique values in the 'Country' column (or any other relevant column)
print(df['Country'].value_counts())

### Example 1: Happiness Score vs. Log GDP per Capita

In [None]:

# Scatter plot of Happiness Score vs Log GDP per Capita
plt.scatter(df['Log GDP per capita'], df['HappinessScore'])
plt.xlabel('Log GDP per Capita')
plt.ylabel('Happiness Score')
plt.title('Happiness Score vs Log GDP per Capita')
plt.show()


### Example 2: Happiness Score vs. Healthy Life Expectancy

In [None]:
# Scatter plot of Happiness Score vs Healthy Life Expectancy at Birth
plt.scatter(df['Healthy life expectancy at birth'], df['HappinessScore'], color='green')
plt.xlabel('Healthy Life Expectancy at Birth')
plt.ylabel('Happiness Score')
plt.title('Happiness Score vs Healthy Life Expectancy at Birth')
plt.show()


###  Tasks:
- **Happiness Score vs Social Support** (`Social support`)
- **Happiness Score vs Freedom to Make Life Choices** (`Freedom to make life choices`)
- **Happiness Score vs Generosity** (`Generosity`)

Update your code using the correct column names based on this.

### Extra Task:
Create **two scatter plots** using columns that do not have a meaningful relationship. For example, you might plot **Generosity** against **Healthy Life Expectancy**.

In [None]:
# Your code related to Happiness Score vs Social Support (Social support) need to be here

import matplotlib.pyplot as plt
import numpy as np


social_cols = [c for c in df.columns if 'social' in c.lower()]
if not social_cols:
    raise KeyError("Could not find a column containing 'social' in its name. Columns available: " + ", ".join(df.columns))

social_col = social_cols[0]   # use the first match
happiness_col_candidates = [c for c in df.columns if 'happi' in c.lower() or 'happiness' in c.lower()]
if not happiness_col_candidates:
    raise KeyError("Could not find a happiness score column. Columns available: " + ", ".join(df.columns))
happiness_col = happiness_col_candidates[0]

mask = df[social_col].notna() & df[happiness_col].notna()
x = df.loc[mask, social_col].astype(float)
y = df.loc[mask, happiness_col].astype(float)

plt.figure(figsize=(7,5))
plt.scatter(x, y, color='tab:blue', edgecolor='k', alpha=0.8, s=60)
plt.xlabel(social_col)
plt.ylabel(happiness_col)
plt.title(f'{happiness_col} vs {social_col}')

m, b = np.polyfit(x, y, 1)
x_line = np.linspace(x.min(), x.max(), 100)
y_line = m * x_line + b
plt.plot(x_line, y_line, color='red', linewidth=2, label=f'fit: y={m:.3f}x+{b:.3f}')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

corr = x.corr(y)
print(f"Using columns: social_col='{social_col}', happiness_col='{happiness_col}'")
print(f"Number of points used: {len(x)}")
print(f"Pearson correlation: {corr:.4f}")
print(f"Linear fit slope = {m:.4f}, intercept = {b:.4f}")

In [None]:
# Your code related to Happiness Score vs Freedom to Make Life Choices (Freedom to make life choices) need to be here

import matplotlib.pyplot as plt
import numpy as np


freedom_cols = [c for c in df.columns if ('freedom' in c.lower()) or ('life choices' in c.lower()) or ('life choice' in c.lower())]
if not freedom_cols:

    freedom_cols = [c for c in df.columns if 'life' in c.lower() and 'choice' in c.lower()]
if not freedom_cols:
    raise KeyError("Could not find a column for freedom to make life choices. Columns available: " + ", ".join(df.columns))

freedom_col = freedom_cols[0]


happiness_col_candidates = [c for c in df.columns if 'happi' in c.lower() or 'happiness' in c.lower()]
if not happiness_col_candidates:
    raise KeyError("Could not find a happiness score column. Columns available: " + ", ".join(df.columns))
happiness_col = happiness_col_candidates[0]


mask = df[freedom_col].notna() & df[happiness_col].notna()
x = df.loc[mask, freedom_col].astype(float)
y = df.loc[mask, happiness_col].astype(float)


plt.figure(figsize=(7,5))
plt.scatter(x, y, color='tab:orange', edgecolor='k', alpha=0.8, s=60)
plt.xlabel(freedom_col)
plt.ylabel(happiness_col)
plt.title(f'{happiness_col} vs {freedom_col}')


m, b = np.polyfit(x, y, 1)
x_line = np.linspace(x.min(), x.max(), 100)
y_line = m * x_line + b
plt.plot(x_line, y_line, color='blue', linewidth=2, label=f'fit: y={m:.3f}x+{b:.3f}')
plt.legend()
plt.grid(alpha=0.3)
plt.show()


corr = x.corr(y)
print(f"Using columns: freedom_col='{freedom_col}', happiness_col='{happiness_col}'")
print(f"Number of points used: {len(x)}")
print(f"Pearson correlation: {corr:.4f}")
print(f"Linear fit slope = {m:.4f}, intercept = {b:.4f}")

In [None]:
# Your code related to Happiness Score vs Generosity (Generosity) need to be here

import matplotlib.pyplot as plt
import numpy as np


generosity_cols = [c for c in df.columns if 'generos' in c.lower()]
if not generosity_cols:

    generosity_cols = [c for c in df.columns if c.lower() in ('generosity','generous','generousity')]
if not generosity_cols:
    raise KeyError("Could not find a column containing 'generos' in its name. Columns available: " + ", ".join(df.columns))

generosity_col = generosity_cols[0]


happiness_col_candidates = [c for c in df.columns if 'happi' in c.lower() or 'happiness' in c.lower() or c.lower() in ('score','ladder score','ladder_score')]
if not happiness_col_candidates:
    raise KeyError("Could not find a happiness score column. Columns available: " + ", ".join(df.columns))
happiness_col = happiness_col_candidates[0]


mask = df[generosity_col].notna() & df[happiness_col].notna()
x = df.loc[mask, generosity_col].astype(float)
y = df.loc[mask, happiness_col].astype(float)


plt.figure(figsize=(7,5))
plt.scatter(x, y, color='tab:purple', edgecolor='k', alpha=0.85, s=60)
plt.xlabel(generosity_col)
plt.ylabel(happiness_col)
plt.title(f'{happiness_col} vs {generosity_col}')


m, b = np.polyfit(x, y, 1)
x_line = np.linspace(x.min(), x.max(), 100)
y_line = m * x_line + b
plt.plot(x_line, y_line, color='black', linewidth=2, label=f'fit: y={m:.3f}x+{b:.3f}')
plt.legend()
plt.grid(alpha=0.3)
plt.show()


corr = x.corr(y)
print(f"Using columns: generosity_col='{generosity_col}', happiness_col='{happiness_col}'")
print(f"Number of points used: {len(x)}")
print(f"Pearson correlation: {corr:.4f}")
print(f"Linear fit slope = {m:.4f}, intercept = {b:.4f}")

In [None]:
# Extra Task 1 code

import matplotlib.pyplot as plt
import numpy as np


generosity_cols = [c for c in df.columns if 'generos' in c.lower()]
if not generosity_cols:
    generosity_cols = [c for c in df.columns if 'generous' in c.lower()]
if not generosity_cols:
    raise KeyError("Could not find a generosity-like column. Available columns: " + ", ".join(df.columns))
generosity_col = generosity_cols[0]


health_cols = [c for c in df.columns if 'healthy' in c.lower() and 'life' in c.lower()]
if not health_cols:
    health_cols = [c for c in df.columns if 'life' in c.lower() and 'expect' in c.lower()]
if not health_cols:
    health_cols = [c for c in df.columns if 'life' in c.lower() and 'birth' in c.lower()]
if not health_cols:
    raise KeyError("Could not find a healthy life expectancy column. Available columns: " + ", ".join(df.columns))
health_col = health_cols[0]


mask = df[generosity_col].notna() & df[health_col].notna()
x = df.loc[mask, generosity_col].astype(float)
y = df.loc[mask, health_col].astype(float)


plt.figure(figsize=(7,5))
plt.scatter(x, y, color='tab:green', edgecolor='k', alpha=0.8, s=70)
plt.xlabel(generosity_col)
plt.ylabel(health_col)
plt.title(f'{health_col} vs {generosity_col} (Extra Task 1)')


m, b = np.polyfit(x, y, 1)
x_line = np.linspace(x.min(), x.max(), 100)
y_line = m * x_line + b
plt.plot(x_line, y_line, color='red', linewidth=2, label=f'fit: y={m:.3f}x+{b:.3f}')
plt.legend()
plt.grid(alpha=0.3)
plt.show()


corr = x.corr(y)
print(f"Using columns: generosity_col='{generosity_col}', health_col='{health_col}'")
print(f"Points used: {len(x)}")
print(f"Pearson correlation: {corr:.4f}")
print(f"Linear fit slope = {m:.4f}, intercept = {b:.4f}")

In [None]:
# Extra Task 2 code

import matplotlib.pyplot as plt
import numpy as np


social_cols = [c for c in df.columns if 'social' in c.lower()]
if not social_cols:

    social_cols = [c for c in df.columns if 'support' in c.lower()]
if not social_cols:
    raise KeyError("Could not find a 'social' or 'support' column. Available columns: " + ", ".join(df.columns))
social_col = social_cols[0]


generosity_cols = [c for c in df.columns if 'generos' in c.lower()]
if not generosity_cols:
    generosity_cols = [c for c in df.columns if 'generous' in c.lower()]
if not generosity_cols:
    raise KeyError("Could not find a 'generosity' column. Available columns: " + ", ".join(df.columns))
generosity_col = generosity_cols[0]


mask = df[social_col].notna() & df[generosity_col].notna()
x = df.loc[mask, social_col].astype(float)
y = df.loc[mask, generosity_col].astype(float)


plt.figure(figsize=(7,5))
plt.scatter(x, y, color='tab:red', edgecolor='k', alpha=0.8, s=70)
plt.xlabel(social_col)
plt.ylabel(generosity_col)
plt.title(f'{generosity_col} vs {social_col} (Extra Task 2)')


m, b = np.polyfit(x, y, 1)
x_line = np.linspace(x.min(), x.max(), 100)
y_line = m * x_line + b
plt.plot(x_line, y_line, color='blue', linewidth=2, label=f'fit: y={m:.3f}x+{b:.3f}')
plt.legend()
plt.grid(alpha=0.3)
plt.show()


corr = x.corr(y)
print(f"Using columns: social_col='{social_col}', generosity_col='{generosity_col}'")
print(f"Points used: {len(x)}")
print(f"Pearson correlation: {corr:.4f}")
print(f"Linear fit slope = {m:.4f}, intercept = {b:.4f}")


---

<a id='part-b'></a>

## Part B — Assignment: Anscombe’s Quartet

### Step 1: Import Necessary Packages

Before we begin, we need to import the necessary Python libraries for plotting and performing the regression. We'll use:

- `matplotlib.pyplot` for creating the graph
- `numpy` for numerical operations


In [None]:
import matplotlib.pyplot as plt
import numpy as np

### Step 2: Anscombe's Quartet Dataset

This dataset is known as **Anscombe's Quartet**, created by statistician Francis Anscombe to illustrate the importance of visualizing data. Despite having nearly identical statistical properties (e.g., mean, variance, correlation, and linear regression), each dataset tells a very different story when graphed.

- **x**: The independent variable, common across three datasets.
- **y1, y2, y3**: Three different dependent variables associated with the same `x` values.
- **x4, y4**: A special case where most of the `x` values are identical, with one outlier.

#### Anscombe's Quartet:

`

In [None]:
# Anscombe's Quartet:
x = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]


In [None]:
plt.scatter(x, y1)


In [None]:
# Calculate the slope (m) and intercept (b) of the line using np.polyfit
m, b = np.polyfit(x, y1, 1)

# Create the regression line
regression_line = m * np.array(x) + b

# Plot the data points and regression line
plt.scatter(x, y1)
plt.plot(x, regression_line,color='red')
plt.xlabel('x')
plt.ylabel('y1')


### Your Task:

Perform the same linear regression process for the following datasets: y2, y3, and y4.
Modify the code to calculate and plot the regression lines for each of these datasets.
Use distinct colors for each plot and appropriately label the axes (y2, y3, etc.).
Discuss any differences you observe when comparing the results across all datasets.


In [None]:
# Your Code need to be here for x and y2

import matplotlib.pyplot as plt
import numpy as np


x  = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y2 = np.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])


m, b = np.polyfit(x, y2, 1)


x_line = np.linspace(x.min() - 1, x.max() + 1, 200)
y_line = m * x_line + b


plt.figure(figsize=(6,4))
plt.scatter(x, y2, color='tab:orange', edgecolor='k', s=70, alpha=0.9)
plt.plot(x_line, y_line, color='black', linewidth=2, label=f'fit: y={m:.3f}x+{b:.3f}')
plt.xlabel('x')
plt.ylabel('y2')
plt.title("Anscombe's Quartet — Dataset y2")
plt.legend()
plt.grid(alpha=0.3)
plt.show()


corr = np.corrcoef(x, y2)[0,1]
print(f"Points: {len(x)}")
print(f"Slope = {m:.4f}, Intercept = {b:.4f}")
print(f"Pearson correlation (x, y2) = {corr:.4f}")

In [None]:
# Your Code need to be here for x and y3

import matplotlib.pyplot as plt
import numpy as np


x  = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y3 = np.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])


m, b = np.polyfit(x, y3, 1)


x_line = np.linspace(x.min() - 1, x.max() + 1, 200)
y_line = m * x_line + b


plt.figure(figsize=(6,4))
plt.scatter(x, y3, color='tab:green', edgecolor='k', s=70, alpha=0.9)
plt.plot(x_line, y_line, color='black', linewidth=2, label=f'fit: y={m:.3f}x+{b:.3f}')
plt.xlabel('x')
plt.ylabel('y3')
plt.title("Anscombe's Quartet — Dataset y3")
plt.legend()
plt.grid(alpha=0.3)
plt.show()


corr = np.corrcoef(x, y3)[0,1]
print(f"Points: {len(x)}")
print(f"Slope = {m:.4f}, Intercept = {b:.4f}")
print(f"Pearson correlation (x, y3) = {corr:.4f}")

In [None]:
# Your code need to be here for x and y4

import matplotlib.pyplot as plt
import numpy as np


x  = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y4 = np.array([6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89])


m, b = np.polyfit(x, y4, 1)


x_line = np.linspace(x.min() - 1, x.max() + 1, 200)
y_line = m * x_line + b


plt.figure(figsize=(6,4))
plt.scatter(x, y4, color='tab:red', edgecolor='k', s=70, alpha=0.9)
plt.plot(x_line, y_line, color='black', linewidth=2, label=f'fit: y={m:.3f}x+{b:.3f}')
plt.xlabel('x')
plt.ylabel('y4')
plt.title("Anscombe's Quartet — Dataset y4 (with common x)")
plt.legend()
plt.grid(alpha=0.3)
plt.show()


corr = np.corrcoef(x, y4)[0,1]
print(f"Points: {len(x)}")
print(f"Slope = {m:.4f}, Intercept = {b:.4f}")
print(f"Pearson correlation (x, y4) = {corr:.4f}")


In [None]:
# Your code need to be here for x4 and y4

import matplotlib.pyplot as plt
import numpy as np


x4 = np.array([8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8])
y4 = np.array([6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89])


m_full, b_full = np.polyfit(x4, y4, 1)
x_line = np.linspace(x4.min() - 1, x4.max() + 1, 200)
y_line_full = m_full * x_line + b_full


mask_no_out = x4 != 19
x_no_out = x4[mask_no_out]
y_no_out = y4[mask_no_out]
m_no_out, b_no_out = np.polyfit(x_no_out, y_no_out, 1)
y_line_no_out = m_no_out * x_line + b_no_out


corr_full = np.corrcoef(x4, y4)[0,1]
corr_no_out = np.corrcoef(x_no_out, y_no_out)[0,1]


plt.figure(figsize=(7,5))
plt.scatter(x4, y4, color='tab:red', edgecolor='k', s=80, label='data points')

outlier_idx = np.where(x4 == 19)[0]
plt.scatter(x4[outlier_idx], y4[outlier_idx], color='gold', edgecolor='k', s=160, marker='*', label='leverage outlier (x=19)')
plt.plot(x_line, y_line_full, color='black', linewidth=2, label=f'fit (with outlier): y={m_full:.3f}x+{b_full:.3f}')
plt.plot(x_line, y_line_no_out, color='blue', linestyle='--', linewidth=2, label=f'fit (without outlier): y={m_no_out:.3f}x+{b_no_out:.3f}')
plt.xlabel('x4')
plt.ylabel('y4')
plt.title("Anscombe's Quartet — x4 vs y4 (effect of leverage outlier)")
plt.legend()
plt.grid(alpha=0.3)
plt.show()


print(f"Full data: slope = {m_full:.4f}, intercept = {b_full:.4f}, Pearson r = {corr_full:.4f}, n = {len(x4)}")
print(f"Without outlier: slope = {m_no_out:.4f}, intercept = {b_no_out:.4f}, Pearson r = {corr_no_out:.4f}, n = {len(x_no_out)}")


### Reflection Question:

After visualizing the linear regression for all four datasets in Anscombe's Quartet, reflect on the following:

#### What is your reflection?
- How do the datasets visually differ despite having similar summary statistics?
- How did the outlier in the `x4, y4` dataset affect the regression line compared to the other datasets?
- Why is it important to visualize data in addition to calculating summary statistics?

Please provide your insights and discuss the importance of data visualization in understanding relationships between variables.


Even though the four datasets have almost the same summary numbers (means, variances, correlations, and regression results), their plots look very different. One looks like a straight line, another is curved, a third has a point that is far above the others, and the fourth has one x-value much larger than the rest which pulls the fitted line toward it. That single extreme x point strongly changes the regression line, so the line no longer represents most of the data. This shows why plotting data is important: graphs reveal shapes and outliers that simple summary statistics can hide, so you should always look at plots before trusting numbers or choosing a model.

---

**Submission Reminder**

- Before submitting, restart the kernel and run **Run All** to ensure a clean, reproducible output.
- Save as `Week2_yourname_Assignment.ipynb`.
- Upload to the designated location in MS Teams / GitHub as instructed.