# NumPy & Pandas Crash Course (ML-Relevant Topics)
This notebook covers the essential NumPy and Pandas concepts needed for Machine Learning data preparation.

## Part 1 – NumPy Basics
NumPy is a library for numerical computing, widely used for handling arrays and matrices in ML.

In [None]:

import numpy as np

# Creating arrays
arr1 = np.array([1, 2, 3])               # From Python list
arr2 =np.zeros((2, 3))                  # 2x3 array of zeros
arr3 =np.ones((3, 3))                   # 3x3 array of ones
arr4 =np.arange(0, 11, 2)               # From 0 to 8 step 2
arr5 =np.linspace(0, 1, 5)              # 5 evenly spaced numbers between 0 and 1

print(arr1)
print(arr2)
print(arr3)
print(arr4)
print(arr5)

[1 2 3]
[[0. 0. 0.]
 [0. 0. 0.]]
[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
[ 0  2  4  6  8 10]
[0.   0.25 0.5  0.75 1.  ]


In [None]:
# Array properties
arr = np.array([[[1,1],[2,2],[3,3]]])
print("Shape:", arr.shape)      # (rows, cols)
print("Dimensions:", arr.ndim)  # Number of dimensions
print("Size:", arr.size)        # Total elements

Shape: (1, 3, 2)
Dimensions: 3
Size: 6


In [None]:
# Indexing &
#print(arr)
print(arr[0,2,0])   # Element at row 0 col 1
#print(arr[:, 1])   # All rows, column 1
#print(arr[1, :])   # Row 1, all columns

3


In [None]:
# Mathematical operations
a = np.array([[1, 2, 3],[1,2,3]])
b = np.array([[4, 5, 6],[4, 5, 6]])

print(a + b)     # Addition
print(a - b)     # Subtraction
print(a * b)     # Multiplication
print(a / b)     # Division
print(np.dot(a, b))  # Dot product

# Random numbers
print(np.random.rand(2, 5))

In [None]:
# Reshaping arrays
arr = np.arange(16) # Create an array with 16 elements
print(arr.reshape(4, 4))  # Reshape to 4 rows and 4 columns

## Part 2 – Pandas Basics
Pandas is a library for data analysis, used for working with tabular data (like spreadsheets).

In [None]:
import pandas as pd

# Creating DataFrame from NumPy arrays
names = np.array(["Usama", "Ahmad", "Salaar"])
ages = np.random.randint(20, 40, size=3)
salaries = np.linspace(30000, 60000, 3)

df = pd.DataFrame({
    "Name": names,
    "Age": ages,
    "Salary": salaries
})
df

Unnamed: 0,Name,Age,Salary
0,Usama,22,30000.0
1,Ahmad,21,45000.0
2,Salaar,23,60000.0


In [None]:
# Reading CSV file
# Upload CSV from your computer in Colab:
# from google.colab import files
# uploaded = files.upload()
# Then replace 'your_file.csv' with the uploaded file name

df = pd.read_csv("tested.csv")  # Replace with actual file name
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [None]:
# Basic exploration
print(df.shape)      # Rows, columns
print(df.info())     # Data types and non-null counts
df.describe()        # Summary statistics

(418, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB
None


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,418.0,418.0,418.0,332.0,418.0,418.0,417.0
mean,1100.5,0.363636,2.26555,30.27259,0.447368,0.392344,35.627188
std,120.810458,0.481622,0.841838,14.181209,0.89676,0.981429,55.907576
min,892.0,0.0,1.0,0.17,0.0,0.0,0.0
25%,996.25,0.0,1.0,21.0,0.0,0.0,7.8958
50%,1100.5,0.0,3.0,27.0,0.0,0.0,14.4542
75%,1204.75,1.0,3.0,39.0,1.0,0.0,31.5
max,1309.0,1.0,3.0,76.0,8.0,9.0,512.3292


In [None]:
# Handling missing values
df.isnull().sum()                   # Count missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)  # Fill with mean
#df.dropna()                         # Drop rows with missing values
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          418 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB
None


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)  # Fill with mean


Droping Colums

In [None]:
# Dropping Columns
# Drop multiple columns
df.drop(columns=['Cabin', 'Ticket','Name','Embarked','PassengerId','SibSp'], inplace=True)

print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  418 non-null    int64  
 1   Pclass    418 non-null    int64  
 2   Sex       418 non-null    object 
 3   Age       418 non-null    float64
 4   Parch     418 non-null    int64  
 5   Fare      417 non-null    float64
dtypes: float64(2), int64(3), object(1)
memory usage: 19.7+ KB
None


In [None]:
print(df.head(5))

   Survived  Pclass     Sex   Age  Parch     Fare
0         0       3    male  34.5      0   7.8292
1         1       3  female  47.0      0   7.0000
2         0       2    male  62.0      0   9.6875
3         0       3    male  27.0      0   8.6625
4         1       3  female  22.0      1  12.2875


In [None]:
X = df.iloc[:,1:]   # All rows, all columns except first one
y = df['Survived']    # All rows, only last column

In [None]:
X

Unnamed: 0,Pclass,Sex,Age,Parch,Fare
0,3,male,34.50000,0,7.8292
1,3,female,47.00000,0,7.0000
2,2,male,62.00000,0,9.6875
3,3,male,27.00000,0,8.6625
4,3,female,22.00000,1,12.2875
...,...,...,...,...,...
413,3,male,30.27259,0,8.0500
414,1,female,39.00000,0,108.9000
415,3,male,38.50000,0,7.2500
416,3,male,30.27259,0,8.0500


In [None]:
y

Unnamed: 0,Survived
0,0
1,1
2,0
3,0
4,1
...,...
413,0
414,1
415,0
416,0
