## Why Not Plain Python?
Standard Python lists and dictionaries are great for general-purpose programming, but they have two major limitations for data science:
* Performance: Python is an interpreted language. Performing mathematical operations on large lists using for loops is incredibly slow.
* Functionality: Handling tabular data, managing missing values, and performing complex statistical operations are cumbersome with basic Python data structures.

`NumPy` and `Pandas` solve these problems.

`NumPy` (Numerical Python): The foundation of the scientific Python ecosystem. It's written in `C/C++` to ensure performance in data manipulation.
`Pandas`: Built on top of `NumPy`, it provides powerful data structures (like the `DataFrame`) designed to make working with "relational" or "labeled" data easy and intuitive.

## Part 1: NumPy
The core of NumPy is the n-dimensional array, or ndarray. It's a grid of values, all of the same data type, which is the key to its incredible performance.

Imagine you have a list of a million numbers and you want to add 5 to each one.

In plain Python, you'd use a loop:

In [1]:
# The slow way with a Python list
my_list = list(range(1000000))
for i in range(len(my_list)):
    my_list[i] += 5

This is slow because Python must check the type of each element and perform the operation one by one.

With NumPy, you perform a "vectorized" operation:

In [3]:
import numpy as np

# The fast, vectorized way with a NumPy array
my_array = np.arange(1000000)
my_array = my_array + 5 # Add 5 to every element at once

This is orders of magnitude faster because the operation my_array + 5 is executed in pre-compiled C code, completely bypassing the Python interpreter's loop overhead. This concept of replacing loops with array expressions is called vectorization, and it is the cornerstone of NumPy.

### 1. Creating Arrays
You can create arrays from Python lists or use built-in NumPy functions.

As you can see this is very similar to what we do with vanilla Python

In [8]:
import numpy as np

# From a Python list
a = np.array([1, 2, 3, 4, 5])
print(f"1D Array: {a}")

# A 2D array (a matrix)
b = np.array([[1, 2, 3], [4, 5, 6]])
print(f"2D Array:\n{b}")

# Create an array of zeros
c = np.zeros((2, 3)) # A 2x3 matrix of zeros
print(f"Zeros:\n{c}")

# Create an array with a range of elements
d = np.arange(0, 10, 2) # Start, stop (exclusive), step
print(f"Range Array: {d}")

1D Array: [1 2 3 4 5]
2D Array:
[[1 2 3]
 [4 5 6]]
Zeros:
[[0. 0. 0.]
 [0. 0. 0.]]
Range Array: [0 2 4 6 8]


### 2. Basic Operations & Slicing
You can perform element-wise operations and select subsets of data just like with lists, but with more power.

In [15]:
data = np.array([[5, 10, 15, 20, 25], [5, 10, 15, 20, 25]])

# Basic math (vectorized!)
print(f"Data * 2 = \n{data * 2}")
print(f"Data - 5 = \n{data - 5}")
print(f"np.sqrt(data) = \n{np.sqrt(data)}")

# Slicing
print(f"Elements from index 1 to 3: \n{data[1:4]}")

# Conditional Slicing (Very powerful, very useful)
large_values = data > 15
print(f"Boolean mask: \n{large_values}")
print(f"Select large values: \n{data[large_values]}")

Data * 2 = 
[[10 20 30 40 50]
 [10 20 30 40 50]]
Data - 5 = 
[[ 0  5 10 15 20]
 [ 0  5 10 15 20]]
np.sqrt(data) = 
[[2.23606798 3.16227766 3.87298335 4.47213595 5.        ]
 [2.23606798 3.16227766 3.87298335 4.47213595 5.        ]]
Elements from index 1 to 3: 
[[ 5 10 15 20 25]]
Boolean mask: 
[[False False False  True  True]
 [False False False  True  True]]
Select large values: 
[20 25 20 25]


## Part 2: Pandas

While `NumPy` is great for raw numbers, real-world data has context. We have rows representing samples (e.g., a customer) and columns representing features (e.g., age, city, purchase amount). `Pandas` is designed to handle exactly this kind of tabular data.

P`andas introduces two primary data structures: the Series and the DataFrame.

`Series`: A one-dimensional labeled array. Think of it as a single column from a spreadsheet.

`DataFrame`: A two-dimensional labeled structure with columns of potentially different types. This is your entire "spreadsheet".

A NumPy 2D array is just a grid of numbers. A Pandas DataFrame has:

* Column labels (e.g., 'Name', 'Age').
* Row labels, called an Index, which provides a powerful way to reference data.
* The ability to store different data types (text, numbers, dates) in different columns.
* A massive library of functions to handle missing data, group, merge, and clean data.

### 1. Creating a DataFrame
The most common way is from a `dictionary` or by reading a (csv) file. You can also convert `NumPy` arrays into a `DataFrame`, but that's for you to explore

In [16]:
import pandas as pd

# Creating a DataFrame from a dictionary
data_dict = {
    'Country': ['USA', 'Canada', 'Australia', 'Japan'],
    'Population': [331, 38, 26, 126],
    'Continent': ['North America', 'North America', 'Oceania', 'Asia']
}

df = pd.DataFrame(data_dict)
print(df)

     Country  Population      Continent
0        USA         331  North America
1     Canada          38  North America
2  Australia          26        Oceania
3      Japan         126           Asia


### How to Inpsect Data

In [18]:
# See the first 5 rows
print("--- First 2 rows ---")
print(df.head(2)) # By default, head() shows 5 rows to any dataframe that has more than 5 rows

# Get a concise summary of the DataFrame
print("\n--- Info ---")
df.info()

# Get descriptive statistics for numerical columns
print("\n--- Describe ---")
print(df.describe())

--- First 2 rows ---
  Country  Population      Continent
0     USA         331  North America
1  Canada          38  North America

--- Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Country     4 non-null      object
 1   Population  4 non-null      int64 
 2   Continent   4 non-null      object
dtypes: int64(1), object(2)
memory usage: 228.0+ bytes

--- Describe ---
       Population
count     4.00000
mean    130.25000
std     141.06352
min      26.00000
25%      35.00000
50%      82.00000
75%     177.25000
max     331.00000


### 3. Selecting Data

In [19]:
# Select a single column (returns a Series)
countries = df['Country']
print(f"\n--- Country Column ---\n{countries}")

# Select multiple columns (returns a DataFrame)
subset = df[['Country', 'Population']]
print(f"\n--- Subset of Columns ---\n{subset}")

# --- Filtering Rows (Conditional Selection) ---
# Select rows based on a condition
high_population = df[df['Population'] > 100]
print(f"\n--- High Population Countries ---\n{high_population}")

# Combine conditions with & (and) and | (or)
north_america_df = df[df['Continent'] == 'North America']
print(f"\n--- North American Countries ---\n{north_america_df}")


--- Country Column ---
0          USA
1       Canada
2    Australia
3        Japan
Name: Country, dtype: object

--- Subset of Columns ---
     Country  Population
0        USA         331
1     Canada          38
2  Australia          26
3      Japan         126

--- High Population Countries ---
  Country  Population      Continent
0     USA         331  North America
3   Japan         126           Asia

--- North American Countries ---
  Country  Population      Continent
0     USA         331  North America
1  Canada          38  North America


### 4. Creating New Columns
You can easily create new columns, often derived from existing ones. This is where the power of NumPy's vectorization comes in handy within Pandas.

In [22]:
# Create a new column 'Population in Billions'
# All calculations are automatically vectorized!
df['Population in Billions'] = df['Population'] / 1000

print(df)

     Country  Population      Continent  Population in Billions
0        USA         331  North America                   0.331
1     Canada          38  North America                   0.038
2  Australia          26        Oceania                   0.026
3      Japan         126           Asia                   0.126


## Conclusion: How They Work Together
You'll use Pandas to load, clean, label, and structure your data in a DataFrame. It's your main tool for exploring and manipulating datasets.

You'll use NumPy's functions and vectorized operations on the data stored within the Pandas DataFrame columns to perform fast numerical computations.

A typical workflow is:
1. Load data into a Pandas `DataFrame` using `pd.read_csv()`.
2. Use Pandas functions like `.head()`, `.info()`, `.describe()` to explore it.
3. Use Pandas filtering `(df[df['column'] > 5])` to select subsets of data.
4. Apply a fast NumPy function `(np.log())` to a column to transform it.
5. Store the result in a new column in your DataFrame.