###


---

# **Module 4: Python for Data Analysis**

Data analysis in Python mostly happens with two powerful libraries: **NumPy** and **Pandas**. Together, they let you work with everything from small CSV files to massive datasets stored in databases.

Let’s start from the beginning and move up in difficulty so it’s clear to **both beginners and advanced learners**.

---

## 🟢 Part 1: Working with NumPy and Pandas

### 🔹 What is NumPy?

* NumPy stands for **Numerical Python**.
* It is the foundation of data science in Python because it allows you to work with **arrays** (grids of numbers).
* Unlike normal Python lists, NumPy arrays are:

  * **Faster** (because they’re implemented in C).
  * **More memory efficient**.
  * **Support vectorized operations** (you can apply math to entire arrays at once).

#### Example: Python list vs NumPy array

In [1]:
import numpy as np

In [2]:
# Python list
numbers_list = [1, 2, 3, 4, 5]


In [3]:
# NumPy array
numbers_array = np.array([1, 2, 3, 4, 5])

In [4]:
print(numbers_list * 2)     # repeats the list → [1,2,3,4,5,1,2,3,4,5]
print(numbers_array * 2)    # multiplies each element → [2 4 6 8 10]

[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
[ 2  4  6  8 10]



👉 With NumPy, multiplying the array by 2 automatically applies to **every element**. That’s called **vectorization**, and it’s why NumPy is so efficient.

#### Common NumPy uses

* Creating arrays (`np.array`, `np.arange`, `np.linspace`).
* Mathematical operations (`np.mean`, `np.std`, `np.sum`).
* Linear algebra (matrix multiplication, eigenvalues).
* Random number generation (`np.random`).

Example:

In [5]:
arr = np.array([10, 20, 30, 40, 50])
print("Mean:", np.mean(arr))   # 30.0
print("Standard deviation:", np.std(arr))  # 14.14

Mean: 30.0
Standard deviation: 14.142135623730951


---

### 🔹 What is Pandas?

* Pandas is built on top of NumPy and gives you **tabular data structures**.
* Instead of just arrays, Pandas provides:

  * **Series** → 1D labeled array (like one Excel column).
  * **DataFrame** → 2D table (like an Excel sheet or SQL table).

Think of Pandas DataFrame as your **main tool for data analysis in Python**.

#### Example: Creating a DataFrame


In [6]:
import pandas as pd

data = {
    "Name": ["Alice", "Bob", "Charlie"],
    "Age": [25, 30, 35],
    "City": ["London", "Bristol", "Manchester"]
}

df = pd.DataFrame(data)
print(df)

      Name  Age        City
0    Alice   25      London
1      Bob   30     Bristol
2  Charlie   35  Manchester



* Each **row** = one record (like one customer).
* Each **column** = one variable (like Age or City).
* You can easily select, filter, and transform data.

👉 Pandas makes it possible to do **in Python what SQL does in databases**.

---

### 🔹 Why use both NumPy and Pandas?

* **NumPy**: For raw **mathematical computation** (fast, efficient arrays).
* **Pandas**: For working with **structured datasets** (tables with rows and columns).

They are usually used **together**: Pandas stores data in DataFrames, but internally it uses NumPy arrays for speed.

---

## 🟡 Part 2: Importing Data (CSV, Excel, SQL)

Now that you know how to create data manually, let’s see how to **import real-world datasets**. Most of your work as a data analyst will start with *loading data into Pandas*.

---

### 🔹 Importing CSV Files

CSV = **Comma Separated Values**, the most common way to store data.

Load into Pandas:



In [7]:
import pandas as pd

data = {
    "customer_id": [101, 102, 103, 104, 105, 106, 107, 108, 109, 110],
    "first_name": ["John", "Mary", "Alex", "John", "Sophia", "Liam", "Olivia", "Noah", "Emma", "James"],
    "last_name": ["Smith", "Jones", "Brown", "Smith", "Wilson", "Taylor", "Johnson", "Williams", "Brown", "Davis"],
    "age": [35, None, 42, 35, 29, None, 31, 27, 29, 45],
    "city": ["London", "Bristol", None, "London", "Manchester", "Liverpool", "Leeds", "London", "Cardiff", "Bristol"],
    "email": [
        "john@gmail.com",
        "mary@yahoo.com",
        "alex@gmail.com",
        "john@gmail.com",
        "sophia.wilson@gmail.com",
        "liam_taylor@hotmail.com",
        "olivia.j@outlook.com",
        "noah.williams@gmail",
        "emma_brown@@gmail.com",
        "james.davis@yahoo.com"
    ]
}

df = pd.DataFrame(data)
df.to_csv("customers.csv", index=False)
print("✅ customers.csv created successfully!")

✅ customers.csv created successfully!


In [8]:
df = pd.read_csv("customers.csv")
print(df.head())   # shows first 5 rows

   customer_id first_name last_name   age        city                    email
0          101       John     Smith  35.0      London           john@gmail.com
1          102       Mary     Jones   NaN     Bristol           mary@yahoo.com
2          103       Alex     Brown  42.0         NaN           alex@gmail.com
3          104       John     Smith  35.0      London           john@gmail.com
4          105     Sophia    Wilson  29.0  Manchester  sophia.wilson@gmail.com


👉 Once in Pandas, you can filter, group, and analyze this data.

---

### 🔹 Importing Excel Files

Excel is another very common format. Pandas supports reading directly from `.xlsx` files.

Example:

In [11]:
import pandas as pd

df = pd.read_excel("sales.xlsx", sheet_name="2024")
print(df.head())


   order_id  customer_id order_date product  quantity  price
0      2001          101 2024-01-12  Laptop         1   1200
1      2002          102 2024-02-15  Tablet         2    600
2      2003          101 2024-03-05   Phone         1    800
3      2004          103 2024-03-10  Laptop         1   1500
4      2005          104 2024-04-01   Phone         3    750



* `sheet_name="2024"` tells Pandas which Excel sheet to load.
* Requires installing `openpyxl` library.

---

### 🔹 Importing from SQL Databases

Sometimes data isn’t stored in files but in a **database** (MySQL, PostgreSQL, SQLite, etc.). You can query the database using SQL and directly load the results into a Pandas DataFrame.

Example with SQLite:



import sqlite3
import pandas as pd

conn = sqlite3.connect("company.db")
df = pd.read_sql("SELECT * FROM employees", conn)
print(df)
conn.close()



👉 This combines **SQL** (to fetch data) and **Pandas** (to analyze it).

---

## ✅ Recap so far

* **NumPy** = efficient number crunching with arrays.
* **Pandas** = tables (DataFrames) for structured data analysis.
* **Importing Data**: You can bring in CSV, Excel, or SQL tables into Pandas and start analyzing.

At this stage, you can:

1. Create a DataFrame.
2. Load real-world datasets into Pandas.
3. Use NumPy functions for calculations.

---

📌 Next in Module 4, we’ll move to:

* Data cleaning & transformation
* Exploratory Data Analysis (EDA)
* Visualization with Matplotlib & Seaborn
* Time-series analysis

---



##