# 01: NumPy and Pandas Basics

## Overview
This notebook covers the fundamental data structures and operations in NumPy and Pandas, which are essential for data manipulation and analysis in machine learning.

## Topics Covered:
1. NumPy Arrays and Operations
2. Array Indexing and Slicing
3. Broadcasting and Vectorization
4. Pandas DataFrames and Series
5. Data Loading and Basic Operations
6. Data Cleaning and Manipulation
7. Grouping and Aggregation

## Focus Points:
- Understanding of vectorization vs loops
- Memory efficiency
- Data transformation techniques
- Handling missing data

## 1. NumPy Basics

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

NumPy version: 2.3.4
Pandas version: 2.3.3


### 1.1 Creating NumPy Arrays

In [2]:
# Different ways to create arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.zeros((3, 4))  # 3x4 array of zeros
arr3 = np.ones((2, 3))   # 2x3 array of ones
arr4 = np.arange(0, 10, 2)  # Array from 0 to 10 with step 2
arr5 = np.linspace(0, 1, 5)  # 5 evenly spaced values from 0 to 1
arr6 = np.random.randn(3, 3)  # 3x3 array of random values from standard normal

print("Array 1:", arr1)
print("\nArray 2 (zeros):\n", arr2)
print("\nArray 4 (arange):", arr4)
print("\nArray 5 (linspace):", arr5)
print("\nArray 6 (random):\n", arr6)

Array 1: [1 2 3 4 5]

Array 2 (zeros):
 [[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]

Array 4 (arange): [0 2 4 6 8]

Array 5 (linspace): [0.   0.25 0.5  0.75 1.  ]

Array 6 (random):
 [[-0.91289895 -1.18367456  0.39318399]
 [-1.19850453  0.21915037 -0.57954566]
 [ 1.35979219 -0.18941738 -0.35833412]]


### 1.2 Array Properties and Attributes

In [3]:
# Array properties
arr = np.array([[1, 2, 3], [4, 5, 6]])

print("Array:\n", arr)
print("\nShape:", arr.shape)  # (rows, columns)
print("Dimensions:", arr.ndim)
print("Size (total elements):", arr.size)
print("Data type:", arr.dtype)
print("Item size (bytes):", arr.itemsize)

Array:
 [[1 2 3]
 [4 5 6]]

Shape: (2, 3)
Dimensions: 2
Size (total elements): 6
Data type: int64
Item size (bytes): 8


### 1.3 Array Indexing and Slicing

In [4]:
# Indexing and slicing
arr = np.array([[1, 2, 3, 4],
                [5, 6, 7, 8],
                [9, 10, 11, 12]])

print("Original array:\n", arr)
print("\nElement at [1, 2]:", arr[1, 2])
print("\nFirst row:", arr[0, :])
print("\nSecond column:", arr[:, 1])
print("\nSubarray:\n", arr[0:2, 1:3])

# Boolean indexing
print("\nElements > 5:", arr[arr > 5])

# Fancy indexing
print("\nRows 0 and 2:", arr[[0, 2], :])

Original array:
 [[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

Element at [1, 2]: 7

First row: [1 2 3 4]

Second column: [ 2  6 10]

Subarray:
 [[2 3]
 [6 7]]

Elements > 5: [ 6  7  8  9 10 11 12]

Rows 0 and 2: [[ 1  2  3  4]
 [ 9 10 11 12]]


### 1.4 Array Operations and Broadcasting

In [5]:
# Element-wise operations
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])

print("Array a:", a)
print("Array b:", b)
print("\nAddition:", a + b)
print("Multiplication:", a * b)
print("Power:", a ** 2)

# Broadcasting example
matrix = np.array([[1, 2, 3],
                   [4, 5, 6],
                   [7, 8, 9]])
vector = np.array([1, 0, 1])

print("\nMatrix:\n", matrix)
print("\nVector:", vector)
print("\nMatrix + Vector (broadcasting):\n", matrix + vector)

Array a: [1 2 3 4]
Array b: [10 20 30 40]

Addition: [11 22 33 44]
Multiplication: [ 10  40  90 160]
Power: [ 1  4  9 16]

Matrix:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

Vector: [1 0 1]

Matrix + Vector (broadcasting):
 [[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]]


### 1.5 Statistical Operations

In [6]:
# Statistical operations
arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

print("Array:\n", arr)
print("\nSum:", arr.sum())
print("Mean:", arr.mean())
print("Standard deviation:", arr.std())
print("Min:", arr.min())
print("Max:", arr.max())
print("\nSum along axis 0 (columns):", arr.sum(axis=0))
print("Sum along axis 1 (rows):", arr.sum(axis=1))

Array:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]

Sum: 45
Mean: 5.0
Standard deviation: 2.581988897471611
Min: 1
Max: 9

Sum along axis 0 (columns): [12 15 18]
Sum along axis 1 (rows): [ 6 15 24]


### 1.6 Reshaping and Transposing

In [7]:
# Reshaping
arr = np.arange(12)
print("Original:", arr)
print("\nReshaped (3x4):\n", arr.reshape(3, 4))
print("\nReshaped (2x6):\n", arr.reshape(2, 6))

# Transpose
matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])
print("\nOriginal matrix:\n", matrix)
print("\nTransposed:\n", matrix.T)

# Flatten
print("\nFlattened:", matrix.flatten())

Original: [ 0  1  2  3  4  5  6  7  8  9 10 11]

Reshaped (3x4):
 [[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]]

Reshaped (2x6):
 [[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]]

Original matrix:
 [[1 2 3]
 [4 5 6]]

Transposed:
 [[1 4]
 [2 5]
 [3 6]]

Flattened: [1 2 3 4 5 6]


## 2. Pandas Basics

### 2.1 Creating Series and DataFrames

In [8]:
# Creating a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print("Series:")
print(s)

# Creating a DataFrame from dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Paris', 'Tokyo', 'Berlin'],
    'Salary': [50000, 60000, 55000, 58000, 62000]
}

df = pd.DataFrame(data)
print("\nDataFrame:")
print(df)

Series:
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

DataFrame:
      Name  Age      City  Salary
0    Alice   25  New York   50000
1      Bob   30    London   60000
2  Charlie   35     Paris   55000
3    David   28     Tokyo   58000
4      Eve   32    Berlin   62000


### 2.2 DataFrame Operations

In [9]:
# Basic information
print("DataFrame Info:")
print(df.info())

print("\nFirst 3 rows:")
print(df.head(3))

print("\nLast 2 rows:")
print(df.tail(2))

print("\nDescriptive statistics:")
print(df.describe())

print("\nColumn names:", df.columns.tolist())
print("Shape:", df.shape)

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   City    5 non-null      object
 3   Salary  5 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 292.0+ bytes
None

First 3 rows:
      Name  Age      City  Salary
0    Alice   25  New York   50000
1      Bob   30    London   60000
2  Charlie   35     Paris   55000

Last 2 rows:
    Name  Age    City  Salary
3  David   28   Tokyo   58000
4    Eve   32  Berlin   62000

Descriptive statistics:
             Age       Salary
count   5.000000      5.00000
mean   30.000000  57000.00000
std     3.807887   4690.41576
min    25.000000  50000.00000
25%    28.000000  55000.00000
50%    30.000000  58000.00000
75%    32.000000  60000.00000
max    35.000000  62000.00000

Column names: ['Name', 'Age', 'City', 'Salary']
Shape: (5

### 2.3 Indexing and Selection

In [10]:
# Selecting columns
print("Age column:")
print(df['Age'])

print("\nMultiple columns:")
print(df[['Name', 'Salary']])

# Selecting rows by position (iloc)
print("\nFirst row (iloc):")
print(df.iloc[0])

print("\nFirst 3 rows, first 2 columns (iloc):")
print(df.iloc[0:3, 0:2])

# Selecting rows by label (loc)
print("\nRows 0-2 (loc):")
print(df.loc[0:2, ['Name', 'Age']])

# Boolean indexing
print("\nPeople older than 30:")
print(df[df['Age'] > 30])

Age column:
0    25
1    30
2    35
3    28
4    32
Name: Age, dtype: int64

Multiple columns:
      Name  Salary
0    Alice   50000
1      Bob   60000
2  Charlie   55000
3    David   58000
4      Eve   62000

First row (iloc):
Name         Alice
Age             25
City      New York
Salary       50000
Name: 0, dtype: object

First 3 rows, first 2 columns (iloc):
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

Rows 0-2 (loc):
      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

People older than 30:
      Name  Age    City  Salary
2  Charlie   35   Paris   55000
4      Eve   32  Berlin   62000


### 2.4 Data Manipulation

In [11]:
# Adding new columns
df['Salary_Bonus'] = df['Salary'] * 1.1
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')

print("DataFrame with new columns:")
print(df)

# Dropping columns
df_copy = df.copy()
df_copy = df_copy.drop('Salary_Bonus', axis=1)
print("\nAfter dropping Salary_Bonus:")
print(df_copy.columns.tolist())

# Sorting
print("\nSorted by Age (descending):")
print(df.sort_values('Age', ascending=False))

DataFrame with new columns:
      Name  Age      City  Salary  Salary_Bonus Age_Group
0    Alice   25  New York   50000       55000.0     Young
1      Bob   30    London   60000       66000.0    Senior
2  Charlie   35     Paris   55000       60500.0    Senior
3    David   28     Tokyo   58000       63800.0     Young
4      Eve   32    Berlin   62000       68200.0    Senior

After dropping Salary_Bonus:
['Name', 'Age', 'City', 'Salary', 'Age_Group']

Sorted by Age (descending):
      Name  Age      City  Salary  Salary_Bonus Age_Group
2  Charlie   35     Paris   55000       60500.0    Senior
4      Eve   32    Berlin   62000       68200.0    Senior
1      Bob   30    London   60000       66000.0    Senior
3    David   28     Tokyo   58000       63800.0     Young
0    Alice   25  New York   50000       55000.0     Young


### 2.5 Handling Missing Data

In [12]:
# Creating DataFrame with missing values
data_with_nan = {
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, np.nan, 8],
    'C': [9, 10, 11, 12]
}
df_nan = pd.DataFrame(data_with_nan)

print("DataFrame with NaN:")
print(df_nan)

print("\nCheck for NaN:")
print(df_nan.isnull())

print("\nCount of NaN per column:")
print(df_nan.isnull().sum())

# Filling missing values
print("\nFill NaN with 0:")
print(df_nan.fillna(0))

print("\nFill NaN with column mean:")
print(df_nan.fillna(df_nan.mean()))

# Dropping missing values
print("\nDrop rows with any NaN:")
print(df_nan.dropna())

print("\nDrop columns with any NaN:")
print(df_nan.dropna(axis=1))

DataFrame with NaN:
     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
2  NaN  NaN  11
3  4.0  8.0  12

Check for NaN:
       A      B      C
0  False  False  False
1  False   True  False
2   True   True  False
3  False  False  False

Count of NaN per column:
A    1
B    2
C    0
dtype: int64

Fill NaN with 0:
     A    B   C
0  1.0  5.0   9
1  2.0  0.0  10
2  0.0  0.0  11
3  4.0  8.0  12

Fill NaN with column mean:
          A    B   C
0  1.000000  5.0   9
1  2.000000  6.5  10
2  2.333333  6.5  11
3  4.000000  8.0  12

Drop rows with any NaN:
     A    B   C
0  1.0  5.0   9
3  4.0  8.0  12

Drop columns with any NaN:
    C
0   9
1  10
2  11
3  12


### 2.6 GroupBy and Aggregation

In [13]:
# Creating sample data
data = {
    'Department': ['IT', 'HR', 'IT', 'HR', 'IT', 'Finance', 'Finance'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Salary': [70000, 50000, 75000, 52000, 80000, 60000, 65000],
    'Years': [3, 5, 4, 6, 7, 2, 3]
}
df_emp = pd.DataFrame(data)

print("Employee Data:")
print(df_emp)

# GroupBy operations
print("\nAverage salary by department:")
print(df_emp.groupby('Department')['Salary'].mean())

print("\nMultiple aggregations:")
print(df_emp.groupby('Department').agg({
    'Salary': ['mean', 'min', 'max'],
    'Years': 'mean'
}))

print("\nCount by department:")
print(df_emp.groupby('Department').size())

Employee Data:
  Department Employee  Salary  Years
0         IT    Alice   70000      3
1         HR      Bob   50000      5
2         IT  Charlie   75000      4
3         HR    David   52000      6
4         IT      Eve   80000      7
5    Finance    Frank   60000      2
6    Finance    Grace   65000      3

Average salary by department:
Department
Finance    62500.0
HR         51000.0
IT         75000.0
Name: Salary, dtype: float64

Multiple aggregations:
             Salary                   Years
               mean    min    max      mean
Department                                 
Finance     62500.0  60000  65000  2.500000
HR          51000.0  50000  52000  5.500000
IT          75000.0  70000  80000  4.666667

Count by department:
Department
Finance    2
HR         2
IT         3
dtype: int64


### 2.7 Merging and Joining DataFrames

In [14]:
# Creating sample DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'Score': [85, 90, 78, 88]
})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Inner join (default)
print("\nInner join:")
print(pd.merge(df1, df2, on='ID', how='inner'))

# Left join
print("\nLeft join:")
print(pd.merge(df1, df2, on='ID', how='left'))

# Outer join
print("\nOuter join:")
print(pd.merge(df1, df2, on='ID', how='outer'))

# Concatenation
df3 = pd.DataFrame({'ID': [7], 'Name': ['Eve']})
print("\nConcatenate:")
print(pd.concat([df1, df3], ignore_index=True))

DataFrame 1:
   ID     Name
0   1    Alice
1   2      Bob
2   3  Charlie
3   4    David

DataFrame 2:
   ID  Score
0   3     85
1   4     90
2   5     78
3   6     88

Inner join:
   ID     Name  Score
0   3  Charlie     85
1   4    David     90

Left join:
   ID     Name  Score
0   1    Alice    NaN
1   2      Bob    NaN
2   3  Charlie   85.0
3   4    David   90.0

Outer join:
   ID     Name  Score
0   1    Alice    NaN
1   2      Bob    NaN
2   3  Charlie   85.0
3   4    David   90.0
4   5      NaN   78.0
5   6      NaN   88.0

Concatenate:
   ID     Name
0   1    Alice
1   2      Bob
2   3  Charlie
3   4    David
4   7      Eve


### 2.8 Reading and Writing Data

In [15]:
# Writing to CSV
df.to_csv('/tmp/sample_data.csv', index=False)
print("Data written to CSV")

# Reading from CSV
df_read = pd.read_csv('/tmp/sample_data.csv')
print("\nData read from CSV:")
print(df_read.head())

# Note: You can also use read_excel(), to_excel(), read_json(), to_json(), etc.

Data written to CSV

Data read from CSV:
      Name  Age      City  Salary  Salary_Bonus Age_Group
0    Alice   25  New York   50000       55000.0     Young
1      Bob   30    London   60000       66000.0    Senior
2  Charlie   35     Paris   55000       60500.0    Senior
3    David   28     Tokyo   58000       63800.0     Young
4      Eve   32    Berlin   62000       68200.0    Senior


# NumPy & Pandas Fundamentals

NumPy array creation, indexing & slicing

 Broadcasting and vectorization for 10-100x faster code

 Pandas DataFrames and Series operations

 loc vs iloc indexing (common interview question!)

 Handling missing data with dropna(), fillna() & interpolate()

 GroupBy operations and data aggregation

 Merging DataFrames with different join types


https://youtu.be/OMwIK4Zg9A8