# Final Project: 7.1 

This is the final project for Statistical Data Science. 

- Understanding fundamental data science concepts, data acquisition, cleaning, exploration, and visualization. 
- Applying Python programming and its libraries for data manipulation and analysis. 
- Utilizing basic statistical methods for analyzing and interpreting data effectively. 
- Demonstrating proficiency in data preprocessing techniques, including handling missing values and outliers appropriately. 
- Developing skills in data visualization to communicate insights effectively from data exploration. 
- Applying machine learning techniques for predictive modeling and interpreting patterns to communicate findings to diverse audiences. 

**Student Name:** OLIVIA ANDERSON
<br>
**Student Name:** ROY WILLIAMS


## Load Dataset

Load the dataset into a pandas DataFrame. Then display the 10 rows of the head or tail of the Dataframe calling the display function.

In [1]:
# Import Required Libraries
import pandas as pd
import numpy as np
import os
import pandas as pd
from IPython.display import display, HTML
pd.set_option('display.max_columns', None)


In [2]:
data_path = os.path.join('..', '..', 'data', '01_raw', 'Worldwide Vaccine Data.csv')
df = pd.read_csv(data_path)

# Give the existing index a name
df.index.name = 'Index'

html = df.head(10).to_html(max_cols=None)
display(HTML(f'<div style="overflow-x:auto">{html}</div>'))

Unnamed: 0_level_0,Country,Doses administered per 100 people,Total doses administered,% of population vaccinated,% of population fully vaccinated
Index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Afghanistan,33,12526397,30.0,28.0
1,Albania,106,3025728,47.0,44.0
2,Algeria,35,15267442,18.0,15.0
3,Angola,74,23701049,47.0,26.0
4,Argentina,252,113272665,92.0,84.0
5,Armenia,73,2150112,38.0,33.0
6,Aruba,164,174215,85.0,79.0
7,Australia,251,63634307,88.0,85.0
8,Austria,228,20263306,78.0,77.0
9,Azerbaijan,138,13857111,54.0,49.0


## Understand the Data

Examine the structure, data types, and summary statistics of the dataset.

In [None]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px

# df = pd.read_csv('your_data.csv')

# Display DataFrame shape
shape_df = pd.DataFrame({'Rows': [df.shape[0]], 'Columns': [df.shape[1]]})
fig_shape = go.Figure(data=[go.Table(
    header=dict(values=list(shape_df.columns), fill_color='paleturquoise', align='left'),
    cells=dict(values=[shape_df.Rows, shape_df.Columns], fill_color='lavender', align='left'))
])
fig_shape.update_layout(title_text='DataFrame Shape')
fig_shape.show()

# Styled summary statistics
describe_df = df.describe().T.reset_index().rename(columns={'index': 'Feature'})
fig_desc = go.Figure(data=[go.Table(
    header=dict(values=list(describe_df.columns), fill_color='lightblue', align='left'),
    cells=dict(values=[describe_df[col] for col in describe_df.columns], fill_color='white', align='left'))
])
fig_desc.update_layout(title_text='Summary Statistics')
fig_desc.show()

# Display columns and their data types
dtypes_df = pd.DataFrame({'Column': df.columns, 'Data Type': df.dtypes.astype(str).values})
fig_cols = go.Figure(data=[go.Table(
    header=dict(values=list(dtypes_df.columns), fill_color='lightgrey', align='left'),
    cells=dict(values=[dtypes_df['Column'], dtypes_df['Data Type']], fill_color='white', align='left'))
])
fig_cols.update_layout(title_text='Columns and Their Data Types')
fig_cols.show()

# Visualize data types count
dtype_counts = df.dtypes.value_counts()
dtype_names = dtype_counts.index.astype(str)
fig_dtype = px.bar(x=dtype_names, y=dtype_counts.values, labels={'x': 'Data Type', 'y': 'Count'},
                  title='Data Types Count', color=dtype_names, color_discrete_sequence=px.colors.qualitative.Pastel)
fig_dtype.update_layout(showlegend=False)
fig_dtype.show()

# Detailed info summary (like df.info + more)
info_df = pd.DataFrame({
    'Column': df.columns,
    'Non-Null Count': df.notnull().sum().values,
    'Missing Values': df.isnull().sum().values,
    'Unique Values': df.nunique().values,
    'Data Type': df.dtypes.astype(str).values
})
fig_info = go.Figure(data=[go.Table(
    header=dict(values=list(info_df.columns), fill_color='lightgreen', align='left'),
    cells=dict(values=[info_df[col] for col in info_df.columns], fill_color='white', align='left'))
])
fig_info.update_layout(title_text='DataFrame Info Summary')
fig_info.show()

#### **Classfication Problem**
We would like to know which group (“well-vaccinated” or “poorly-vaccinated”) a country belongs to based on it's features. We will use supervised learning to predict the group of a country based on its features.
<br>

The first thing we will do is label each country as “well-vaccinated” or “poorly-vaccinated” based on a threshold for % of population fully vaccinated. This is our label for the classification problem.
<br>

We will use all other columns as features to predict the label.


First, let’s look at the distribution of “% of population fully vaccinated” to pick a good threshold for the labels.

In [4]:
import plotly.express as px

fig = px.histogram(
    df,
    x='% of population fully vaccinated',
    nbins=20,
    title='Distribution of Full Vaccination Rates by Country',
    labels={'% of population fully vaccinated': '% of Population Fully Vaccinated', 'count': 'Number of Countries'},
    color_discrete_sequence=['#636EFA']
)
fig.update_layout(
    xaxis_title='% of Population Fully Vaccinated',
    yaxis_title='Number of Countries',
    bargap=0.1
)
fig.show()

It is a common approach to use the median values as the divivding line between labels. In our case, it is dividing “well-vaccinated” and “poorly-vaccinated.” This is why lesson 1.1 was so important and must be reviewed again to understand median, mean, and mode.
<br>

- Countries above the median = “well-vaccinated”
- Countries below the median = “poorly-vaccinated”

#### **Traing Classification Model**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Calculate the median
median_thres = df['% of population fully vaccinated'].median()

# Create the label
df['well_vaccinated'] = (df['% of population fully vaccinated'] > median_thres).astype(int)

# Prepare features (excluding country and target)
features = [
    'Doses administered per 100 people',
    'Total doses administered',
    '% of population vaccinated'
]
X = df[features]
y = df['well_vaccinated']

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Train a logistic regression classifier
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
report = classification_report(y_test, y_pred, output_dict=True)
conf_matrix = confusion_matrix(y_test, y_pred)
accuracy = accuracy_score(y_test, y_pred)


#### **Performance Evaluation**
We will use the classification report, confusion matrix, and accuracy score to evaluate the performance of our model.

```python

#### **Report:**

In [8]:
import plotly.graph_objects as go
import numpy as np

metrics = report

labels = ['Class 0', 'Class 1', 'Macro Avg', 'Weighted Avg']
precisions = [
    metrics['0']['precision'],
    metrics['1']['precision'],
    metrics['macro avg']['precision'],
    metrics['weighted avg']['precision']
]
recalls = [
    metrics['0']['recall'],
    metrics['1']['recall'],
    metrics['macro avg']['recall'],
    metrics['weighted avg']['recall']
]
f1_scores = [
    metrics['0']['f1-score'],
    metrics['1']['f1-score'],
    metrics['macro avg']['f1-score'],
    metrics['weighted avg']['f1-score']
]

x = np.arange(len(labels))

fig = go.Figure()
fig.add_trace(go.Bar(x=labels, y=precisions, name='Precision'))
fig.add_trace(go.Bar(x=labels, y=recalls, name='Recall'))
fig.add_trace(go.Bar(x=labels, y=f1_scores, name='F1-Score'))

fig.update_layout(
    barmode='group',
    title='Classification Metrics by Class',
    yaxis=dict(title='Score', range=[0, 1.1]),
    xaxis=dict(title='Class'),
    legend=dict(title='Metric'),
    bargap=0.15
)
fig.show()


**Interpretation**
- Precision (Blue):
  Out of all countries predicted to be in a class, how many were actually correct?

    - High for Class 0 (1.0) means every time the model predicted "Class 0," it was right.

    - Lower for Class 1 (0.74) means some predictions for "Class 1" were actually wrong.

- Recall (Red):
  Out of all countries that truly belong to a class, how many did the model find?

    - High for Class 1 (1.0) means the model caught every "Class 1" country.

    - Lower for Class 0 (0.74) means it missed some "Class 0" countries.

- F1-Score (Green):
  Harmonic mean of precision and recall. High values mean the model balances both well.

**Macro vs Weighted Averages**
- Macro Avg:
The simple average across both classes (treats each class equally, no matter how many countries are in each).

- Weighted Avg:
Takes into account how many countries are in each class—gives a sense of overall performance, especially if classes are unbalanced.

**Insights from Our Chart**
- Class 0: Model is very precise but less recall, so it is very confident when it predicts this class but misses some actual Class 0s.

- Class 1: Model finds all actual Class 1s (high recall), but sometimes makes mistakes in predicting it (lower precision).

- Both F1-scores are balanced and high, so the model is performing well overall.

- Averages (macro/weighted) show balanced and strong model performance for both classes.

#### **Confusion Matrix:**


In [9]:
import plotly.figure_factory as ff
import numpy as np

z = conf_matrix
z_text = [[str(y) for y in x] for x in z]

fig = ff.create_annotated_heatmap(
    z,
    x=['Predicted 0', 'Predicted 1'],
    y=['Actual 0', 'Actual 1'],
    annotation_text=z_text,
    colorscale='Blues',
    showscale=True
)
fig.update_layout(title_text='Confusion Matrix', xaxis_title='Predicted Label', yaxis_title='True Label')
fig.show()

**Confusion Matrix Interpretation**

- **Actual 1, Predicted 1 (bottom right):**
  - 20 times the model correctly predicted Class 1 (True Positives for Class 1)
- **Actual 0, Predicted 0 (top left):**
  - 0 times the model correctly predicted Class 0 (True Positives for Class 0)
- **Actual 0, Predicted 1 (top right):**
  - 20 times the model incorrectly predicted Class 1 when it was actually Class 0 (False Positives for Class 1, False Negatives for Class 0)
- **Actual 1, Predicted 0 (bottom left):**
  - 7 times the model incorrectly predicted Class 0 when it was actually Class 1 (False Positives for Class 0, False Negatives for Class 1)

---

### What Does This Mean?
- The model **never predicted Class 0 correctly** (0 in top left).
- All "Actual 1" were predicted as Class 1 (20 in bottom right).
- Most "Actual 0" were wrongly classified as Class 1 (20 in top right).
- Only 7 "Actual 0" were correctly predicted as Class 1 (bottom right).

**Conclusion:**
- The model is heavily biased towards predicting Class 1.
- It’s missing all Actual Class 0 cases (no true negatives), so it does not differentiate well between the two classes.
- This often means our model is overfitting, the classes are imbalanced, or the features do not separate the classes well.

**Fixes or improvements**
- Try different algorithms (like Decision Tree, Random Forest)
- Add more or better features
- Handle class imbalance (if present)
- Perform hyperparameter tuning

#### **Accuracy Score:**


In [11]:
import plotly.graph_objects as go

fig = go.Figure(go.Indicator(
    mode = "gauge+number",
    value = accuracy * 100,
    number = {'suffix': '%'},
    title = {'text': "Model Accuracy"},
    gauge = {
        'axis': {'range': [0, 100]},
        'bar': {'color': "#636EFA"},
        'steps': [
            {'range': [0, 50], 'color': '#FFDDDD'},
            {'range': [50, 80], 'color': '#FFEFC6'},
            {'range': [80, 100], 'color': '#D4F7DC'}
        ],
        'threshold': {
            'line': {'color': "red", 'width': 4},
            'thickness': 0.75,
            'value': accuracy * 100
        }
    }
))
fig.update_layout(height=300)
fig.show()


**Accuracy**

- Accuracy is the proportion of correct predictions out of all predictions made by our model.
- Out of all country cases in our test data, the model made the right classification 85.1% of the time.
- 85.1% accuracy is generally good, but we need to check the **Class balance**. If our classes are imbalanced, accuracy can be misleading (for example, always predicting the majority class could also yield high accuracy).

#### **Conclusion:**
The classification report shows the precision, recall, and F1-score for each class. The confusion matrix shows the number of true positives, true negatives, false positives, and false negatives. The accuracy score shows the percentage of correct predictions.
<br>

Our classification model correctly predicts the vaccination class for countries 85.1% of the time, as shown by the gauge. However, it struggles with Class 0, missing all actual cases and only predicting Class 1 correctly. The model is biased towards Class 1, indicating potential overfitting or class imbalance issues. To improve, We can try different algorithms, add better features, or handle class imbalance.
