# **Activity 2: Python-Pandas Exercise**

Objectives:
- Understand Python syntax (variables, loops, functions).
- Learn Pandas basics (Series, DataFrames, reading files).
- Perform data cleaning (handling missing values, correcting formats, removing duplicates).
- Apply concepts in a real-world case study.

# Part 1: Hands-on Python & Pandas Basics

1. Install the Pandas library in your environment.

pip install pandas

2. Import the  pandas package under the name `pd`

In [2]:
import pandas as pd

3. Print the pandas version

In [3]:
print(pd.__version__)

2.2.3


4. Create a variable `x` with the value 10 and a string variable `y` with "Fortes in Fide!"

In [5]:
x = 10
y = "Fortes in Fide!"

print(x,y)

10 Fortes in Fide!


5. Define a list with numbers `[1, 2, 3, 4, 5]` and a dictionary with keys `name` and `age`

In [6]:
numbers = [1, 2, 3, 4, 5]

person = {"name" : "jemriz", "age" : 18}

myPerson = pd.Series(person)
myNumbers = pd.Series(numbers)

print(myPerson, myNumbers)

name    jemriz
age         18
dtype: object 0    1
1    2
2    3
3    4
4    5
dtype: int64


6. Write a function `greet(name)` that returns "Magis, (name)"!

In [1]:
def greet(name):
    return f"Magis, {name}!"

print(greet("Shan"))

Magis, Shan!


7. Write a Python function that takes a user’s name as input and prints a personalized greeting.

In [4]:
def greet_User():
    user_name = input("Enter your name: ")
    print(f"Hello, {user_name}!")

greet_User()


Hello, shan!


8. Modify **Number 7** that if the user does not enter a name, it defaults to "Guest".

In [3]:
def greet_User():
    name = input("Enter your name: ").strip()  # Remove leading/trailing spaces
    if not name:  # If the input is empty, assign "Guest"
        name = "Guest"
    print(f"Hello, {name}!")

# Call the function to execute
greet_User()


Hello, Guest!


9. Create a Pandas Series from `[10, 20, 30, 40]`.

In [6]:
import pandas as pd

a = [10, 20, 30, 40]

myvar = pd.Series(a)

print(myvar)

0    10
1    20
2    30
3    40
dtype: int64


10.  Create a DataFrame with columns `A` and `B`.

In [7]:
data = {
    "A":[40,50,60],
    "B": [10,20,30]
}

df = pd.DataFrame(data)

print(df)

    A   B
0  40  10
1  50  20
2  60  30


# Part 2: Working with a Dataset 🛥️

1. Load the Titanic dataset from a local file and display the first five rows.

In [11]:
import pandas as pd


file_path = "dataset/titanic_dataset.csv"  
df = pd.read_csv(file_path)


print(df.head())


   PassengerId  Survived  Pclass  \
0          892         0       3   
1          893         1       3   
2          894         0       2   
3          895         0       3   
4          896         1       3   

                                           Name     Sex   Age  SibSp  Parch  \
0                              Kelly, Mr. James    male  34.5      0      0   
1              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   
2                     Myles, Mr. Thomas Francis    male  62.0      0      0   
3                              Wirz, Mr. Albert    male  27.0      0      0   
4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1   

    Ticket     Fare Cabin Embarked  
0   330911   7.8292   NaN        Q  
1   363272   7.0000   NaN        S  
2   240276   9.6875   NaN        Q  
3   315154   8.6625   NaN        S  
4  3101298  12.2875   NaN        S  


2. Display the dataset's column names, data types.

In [13]:
import pandas as pd


file_path = "dataset/titanic_dataset.csv"  
df = pd.read_csv(file_path)

print("Column names:")
print(df.columns)

print("\nData Types:")
print(df.dtypes)


Column names:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Data Types:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


3. Display the dataset's missing values.

In [14]:
import pandas as pd


file_path = "dataset/titanic_dataset.csv"  
df = pd.read_csv(file_path)


missing_values = df.isnull().sum()


print("Missing Values in Each Column:")
print(missing_values[missing_values > 0])

Missing Values in Each Column:
Age       86
Fare       1
Cabin    327
dtype: int64


4. Display the `Name`, `Age`, and `Fare` columns from the dataset. (first 10)

In [21]:
import pandas as pd


file_path = "dataset/titanic_dataset.csv"  
df = pd.read_csv(file_path)

columns_to_display = ["Name", "Age", "Fare"]
df_selected = df[columns_to_display].head(11)

print(columns_to_display)
print(df_selected)

['Name', 'Age', 'Fare']
                                            Name   Age     Fare
0                               Kelly, Mr. James  34.5   7.8292
1               Wilkes, Mrs. James (Ellen Needs)  47.0   7.0000
2                      Myles, Mr. Thomas Francis  62.0   9.6875
3                               Wirz, Mr. Albert  27.0   8.6625
4   Hirvonen, Mrs. Alexander (Helga E Lindqvist)  22.0  12.2875
5                     Svensson, Mr. Johan Cervin  14.0   9.2250
6                           Connolly, Miss. Kate  30.0   7.6292
7                   Caldwell, Mr. Albert Francis  26.0  29.0000
8      Abrahim, Mrs. Joseph (Sophie Halaut Easu)  18.0   7.2292
9                        Davies, Mr. John Samuel  21.0  24.1500
10                              Ilieff, Mr. Ylio   NaN   7.8958


 5. Print the descriptive statistics of the Titanic dataset.

In [23]:
import pandas as pd


file_path = "dataset/titanic_dataset.csv"  
df = pd.read_csv(file_path)

df_statistics = df.describe()

print(df_statistics)

       PassengerId    Survived      Pclass         Age       SibSp  \
count   418.000000  418.000000  418.000000  332.000000  418.000000   
mean   1100.500000    0.363636    2.265550   30.272590    0.447368   
std     120.810458    0.481622    0.841838   14.181209    0.896760   
min     892.000000    0.000000    1.000000    0.170000    0.000000   
25%     996.250000    0.000000    1.000000   21.000000    0.000000   
50%    1100.500000    0.000000    3.000000   27.000000    0.000000   
75%    1204.750000    1.000000    3.000000   39.000000    1.000000   
max    1309.000000    1.000000    3.000000   76.000000    8.000000   

            Parch        Fare  
count  418.000000  417.000000  
mean     0.392344   35.627188  
std      0.981429   55.907576  
min      0.000000    0.000000  
25%      0.000000    7.895800  
50%      0.000000   14.454200  
75%      0.000000   31.500000  
max      9.000000  512.329200  


6. Remove rows with missing values in the `Age` column.

In [24]:
import pandas as pd


file_path = "dataset/titanic_dataset.csv"  
df = pd.read_csv(file_path)

df_cleaned = df.dropna(subset=["Age"])

print(df_cleaned.head())

   PassengerId  Survived  Pclass  \
0          892         0       3   
1          893         1       3   
2          894         0       2   
3          895         0       3   
4          896         1       3   

                                           Name     Sex   Age  SibSp  Parch  \
0                              Kelly, Mr. James    male  34.5      0      0   
1              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   
2                     Myles, Mr. Thomas Francis    male  62.0      0      0   
3                              Wirz, Mr. Albert    male  27.0      0      0   
4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1   

    Ticket     Fare Cabin Embarked  
0   330911   7.8292   NaN        Q  
1   363272   7.0000   NaN        S  
2   240276   9.6875   NaN        Q  
3   315154   8.6625   NaN        S  
4  3101298  12.2875   NaN        S  


7. Remove duplicate rows from the dataset.

In [26]:
import pandas as pd


file_path = "dataset/titanic_dataset.csv"  
df = pd.read_csv(file_path)

df_no_duplicates = df.drop_duplicates()

print(df_no_duplicates.head())

   PassengerId  Survived  Pclass  \
0          892         0       3   
1          893         1       3   
2          894         0       2   
3          895         0       3   
4          896         1       3   

                                           Name     Sex   Age  SibSp  Parch  \
0                              Kelly, Mr. James    male  34.5      0      0   
1              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   
2                     Myles, Mr. Thomas Francis    male  62.0      0      0   
3                              Wirz, Mr. Albert    male  27.0      0      0   
4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1   

    Ticket     Fare Cabin Embarked  
0   330911   7.8292   NaN        Q  
1   363272   7.0000   NaN        S  
2   240276   9.6875   NaN        Q  
3   315154   8.6625   NaN        S  
4  3101298  12.2875   NaN        S  


8. Compute and display the correlation matrix of the dataset.

In [28]:
import pandas as pd


file_path = "dataset/titanic_dataset.csv"  
df = pd.read_csv(file_path)


correlation_matrix = df.corr()

print(correlation_matrix)

ValueError: could not convert string to float: 'Kelly, Mr. James'

In [None]:
1 df.corr() only computes correlations for numerical columns like integers and floats.

2 When pandas tries to convert all columns into numerical values, it encounters a string ("Kelly, Mr. James") and raises a ValueError.

# Part 2: Working with Case Studies

When working on these case studies, **always ensure that your code is properly documented and clearly presented**. Follow these key principles:  

### **1. Always Show Your Code**  
- Every step of data exploration, cleaning, and analysis should include **visible code outputs**.  
- Do not skip showing your process, as transparency is essential for reproducibility.  

### **2. Proper Documentation is Necessary**  
- Use **comments (`#`) in Python** to explain your code clearly.  
- Add **Markdown cells** to describe each step before executing the code.  
- Explain key findings in simple language to make the analysis easy to understand.  

### **3. Use Readable and Organized Code**  
- Follow a **step-by-step approach** to keep the notebook structured.  
- Use **proper variable names** and avoid hardcoding values where possible.

# **Case Study 1: Iris Flower Classification** 🌸  

### **Background**  
A botanical research institute wants to develop an automated system that classifies different species of **iris flowers** based on their **sepal and petal measurements**.  The dataset consists of **150 samples**, labeled as **Setosa, Versicolor, or Virginica**.  

### **Problem Statement**  
Can we use **sepal and petal dimensions** to correctly classify the **species of an iris flower**?  

### **Task Description**  

#### **1. Data Exploration**  
- Load the dataset and display the first few rows.  
- Identify any missing or inconsistent values.  

#### **2. Data Cleaning**  
- Check for missing values and handle them appropriately.  
- Convert categorical species labels into a format suitable for analysis.  

#### **3. Basic Data Analysis**  
- Find the average sepal and petal dimensions for each species.  
- Identify correlations between different flower measurements.  

#### **4. Visualization**  
- Create simple visualizations (e.g., histograms, scatter plots) to understand data distribution.  

#### **5. Insights & Interpretation**  
- Summarize key findings, such as which features best distinguish flower species.  

# **Case Study 2: Netflix Content Analysis** 🎬  

## **Background**  
Netflix is a leading streaming platform with a vast collection of movies and TV shows. The company wants to analyze its **content library** to understand trends in **genres, release years, and regional distribution**.  

## **Problem Statement**  
How can we use **Netflix’s dataset** to gain insights into content distribution, popular genres, and release trends over time?  

## **Task Description**  

### **1. Data Exploration**  
- Load the dataset and inspect its structure.  
- Identify key columns such as title, genre, release year, and country.  

### **2. Data Cleaning**  
- Check for missing or incorrect values in key columns.  
- Remove duplicates and format the date-related data properly.  

### **3. Basic Data Analysis**  
- Count the number of movies vs. TV shows.  
- Identify the most common genres and countries producing content.  
- Analyze the number of releases per year to observe trends.  

### **4. Insights & Interpretation**  
- Summarize key findings, such as trends in Netflix's content production over time.  
