
### **Week 3: Introduction to Libraries for Data Analysis (NumPy & Pandas)**
**Objective**: Introduce the essential Python libraries, **NumPy** and **Pandas**, that are commonly used for data handling and analysis.



### **1. Introduction to NumPy**
#### **Concept**: What is NumPy?
- **NumPy** is a library for numerical computations in Python. It provides support for working with arrays and includes 
functions for mathematical operations.

#### **Installing and Importing NumPy**:
1. **Install NumPy**:



   - If needed, install NumPy by running:

In [2]:
# !pip install numpy

2. **Importing NumPy**:

In [3]:
import numpy as np



### **2. Working with NumPy Arrays**
#### **Concept**: Creating and Manipulating Arrays
- **Arrays** are the core data structure in NumPy, similar to lists but optimized for numerical computations.

#### **Examples**:
1. **Creating Arrays**:

In [4]:
# Creating a 1D array
array1 = np.array([1, 2, 3, 4, 5])
print(array1)

[1 2 3 4 5]



2. **Array Operations**:
   - Basic operations like addition and multiplication can be performed element-wise:

In [5]:
array2 = np.array([10, 20, 30, 40, 50])

print(array1 + array2)   # Output: [11 22 33 44 55]
print(array1 * 2)        # Output: [ 2  4  6  8 10]

[11 22 33 44 55]
[ 2  4  6  8 10]


3. **Array Statistics**:
   - Students can use NumPy functions to calculate statistics such as mean, median, and standard deviation.

In [6]:
print(np.mean(array1))    # Output: 3.0
print(np.median(array1))  # Output: 3.0
print(np.std(array1))     # Output: 1.4142... 

3.0
3.0
1.4142135623730951




#### **Hands-On Exercise**:
- Load a dataset (CSV or Excel) containing information on students (e.g., name, age, scores, and grade level).
Can you load the dataset into Pandas and display the first five rows?


### **3. Introduction to Pandas**
#### **Concept**: What is Pandas?
- **Pandas** is a powerful library for data manipulation and analysis. It provides **DataFrames**, which are tabular data structures (like spreadsheets) with rows and columns.

#### **Installing and Importing Pandas**:
1. **Install Pandas**:

In [8]:
#!pip install pandas


2. **Importing Pandas**:


In [9]:
import pandas as pd

### **4. Creating and Manipulating DataFrames**
#### **Concept**: Introduction to DataFrames
- A **DataFrame** is a 2D structure where data is organized in rows and columns. Each column can have a different data type.

#### **Examples**:
1. **Creating a DataFrame from a Dictionary**:

In [11]:
data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["Houston", "Los Angeles", "Chicago"]
}

df = pd.DataFrame(data)
print(df)

      Name  Age         City
0    Alice   25      Houston
1      Bob   30  Los Angeles
2  Charlie   35      Chicago


2. **Accessing Columns and Rows**:
   - Accessing columns by their names:

In [12]:
print(df["Name"])    # Outputs the "Name" column

0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object


   - Accessing rows by index:

In [13]:
print(df.iloc[0])    # Outputs the first row

Name      Alice
Age          25
City    Houston
Name: 0, dtype: object


3. **Basic DataFrame Operations**:
   - Adding a new column:

In [14]:
df["Salary"] = [70000, 80000, 75000]
print(df)

      Name  Age         City  Salary
0    Alice   25      Houston   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   75000


4. **Data Summary**:
   - Getting a summary of the data:

In [19]:
print(df.describe())   # Output: Statistics of numerical columns
print (f"--------\n")
print(df.info())       # Output: Info on DataFrame columns and data types

        Age   Salary
count   3.0      3.0
mean   30.0  75000.0
std     5.0   5000.0
min    25.0  70000.0
25%    27.5  72500.0
50%    30.0  75000.0
75%    32.5  77500.0
max    35.0  80000.0
--------

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
 3   Salary  3 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 224.0+ bytes
None




#### **Hands-On Exercise**:
- Create a DataFrame with three columns (e.g., Name, Score, and Grade) and add a new column for "Class".

---

### **5. Loading Data into Pandas**
#### **Concept**: Loading External Data (CSV and Excel)
- **CSV (Comma-Separated Values)** and **Excel** files are common data formats that Pandas can easily import for analysis.

#### **Lets first make a csv file to work with:**

In [23]:
import csv

# Define data for the CSV file
data = [
    ["Name", "Age", "City", "Salary"],
    ["Alice", 28, "New York", 70000],
    ["Bob", 34, "Los Angeles", 85000],
    ["Charlie", 25, "Chicago", 62000],
    ["Diana", 30, "Houston", 76000],
    ["Eve", 40, "San Francisco", 90000]
]

# Write data to a CSV file
with open("Data/sample_data_week03.csv", "w", newline="") as file:
    writer = csv.writer(file)
    writer.writerows(data)

print("CSV file 'sample_data_week03.csv' created successfully.")


CSV file 'sample_data_week03.csv' created successfully.



#### **Examples**:
1. **Loading a CSV File**:

In [24]:
df = pd.read_csv("Data/sample_data_week03.csv")
print(df.head())   # Display the first five rows

      Name  Age           City  Salary
0    Alice   28       New York   70000
1      Bob   34    Los Angeles   85000
2  Charlie   25        Chicago   62000
3    Diana   30        Houston   76000
4      Eve   40  San Francisco   90000


2. **Loading an Excel File**:
#### **Lets first make a xlsx file to work with:**

In [26]:
%pip install openpyxl

[33mDEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0mCollecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
[33m  DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621[0m[33m
[0m[33m  DEPRECATION: Configuring installation scheme with distutils config files is depr

In [27]:
from openpyxl import Workbook

# Create a workbook and select the active sheet
wb = Workbook()
sheet = wb.active
sheet.title = "Sheet1"

# Define data for the Excel file
data = [
    ["Name", "Age", "City", "Salary"],
    ["Alice", 28, "New York", 70000],
    ["Bob", 34, "Los Angeles", 85000],
    ["Charlie", 25, "Chicago", 62000],
    ["Diana", 30, "Houston", 76000],
    ["Eve", 40, "San Francisco", 90000]
]

# Write data to the sheet
for row in data:
    sheet.append(row)

# Save the workbook
wb.save("Data/sample_data_week03.xlsx")

print("Excel file 'sample_data.xlsx' created successfully.")


Excel file 'sample_data.xlsx' created successfully.


In [28]:
df = pd.read_excel("Data/sample_data_week03.xlsx", sheet_name="Sheet1")
print(df.head())

      Name  Age           City  Salary
0    Alice   28       New York   70000
1      Bob   34    Los Angeles   85000
2  Charlie   25        Chicago   62000
3    Diana   30        Houston   76000
4      Eve   40  San Francisco   90000


#### **Hands-On Exercise**:
- Provide a sample CSV file (or help them download one from the web), load it into Pandas, and display the first few rows.

### **6. Basic Data Cleaning in Pandas**
#### **Concept**: Data Cleaning Basics (Renaming Columns, Handling Missing Values)

#### **Examples**:
1. **Renaming Columns**:
   - To make column names more readable or standardized:

In [42]:
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, None],  # None represents a missing value
    "city": ["New York", None, "Chicago"]
})

df.rename(columns={"name": "Name", "age": "Age", "city": "City"}, inplace=True)
print(df)

      Name   Age      City
0    Alice  25.0  New York
1      Bob  30.0      None
2  Charlie   NaN   Chicago


2. **Handling Missing Values**:
   - Filling missing values with a specific value:

In [43]:
df.fillna({"Age": 28}, inplace=True)
#df["Age"].fillna(28, inplace=True)   # Replace missing age with 28
print(df)

      Name   Age      City
0    Alice  25.0  New York
1      Bob  30.0      None
2  Charlie  28.0   Chicago


   - Dropping rows with missing values:

In [44]:
df.dropna(inplace=True)    # Removes rows with any missing values
print(df)

      Name   Age      City
0    Alice  25.0  New York
2  Charlie  28.0   Chicago


3. **Removing Duplicates**:
   - Removing duplicate entries in the data:

In [45]:
df = df.drop_duplicates()


#### **Hands-On Exercise**:
- Rename columns in a DataFrame, replace missing values in a column with a specified value, and remove duplicate rows.



### **7. Practical Example: Loading and Cleaning a Dataset**

#### **Practical Assignment. Do the following step by step**:
1. **Loading the Dataset**:
   - Load a dataset (CSV or Excel) containing information on students (e.g., name, age, scores, and grade level).

In [46]:
 df = pd.read_csv("students_data.csv")



2. **Renaming Columns**:
   - Standardize column names.

In [47]:
df.rename(columns={"name": "Name", "score": "Score"}, inplace=True)

3. **Handling Missing Data**:
   - Fill in missing ages with the average age:

In [49]:
average_age = df["Age"].mean()
df.fillna({"Age": average_age}, inplace=True)
#df["Age"].fillna(average_age, inplace=True)

4. **Removing Duplicates**:
   - Remove duplicate student entries if any exist.

In [50]:
df = df.drop_duplicates()

5. **Calculating Basic Statistics**:
   - Calculate and print the average score:

In [52]:
print("Average Score:", df["Score"].mean())



#### **Hands-On Exercise**:
- Load a dataset, rename at least two columns, fill in missing values for one column, and drop duplicates. Then, calculate the average of a chosen numerical column.

### **Recap of Week 3**:
- **Key Concepts**: Arrays in NumPy; DataFrames in Pandas; loading, manipulating, and cleaning data.
- **Practice**: By the end of the week, students should be able to load datasets, rename columns, handle missing values, and perform simple calculations.