# Lab 1: Introduction to Python Libraries for Machine Learning

## NumPy Exercises

In [1]:
import numpy as np

### 1. Basic Array Creation & Manipulation

Create a 1D array of numbers from 1 to 20.

In [2]:
a = np.arange(1,21)
print(a)

[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20]


In [3]:
a = np.array([1,2,3,4,"....", 20])
print(a)

['1' '2' '3' '4' '....' '20']


Create a 3×4 matrix of ones and reshape it to 4×3.

In [4]:
matrix = np.ones((3,4))
print(matrix)
print(matrix.shape)

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]
(3, 4)


In [5]:
matrix = matrix.reshape((4,3))
print(matrix)
print(matrix.shape)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
(4, 3)


Create a 5×5 identity matrix.

In [6]:
identity = np.identity(5)
print(identity)
print(identity.shape)

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]
(5, 5)


Generate 15 equally spaced numbers between 5 and 50.

In [7]:
linespace = np.linspace(5,50,15)
print(linespace)
print(linespace.size)

[ 5.          8.21428571 11.42857143 14.64285714 17.85714286 21.07142857
 24.28571429 27.5        30.71428571 33.92857143 37.14285714 40.35714286
 43.57142857 46.78571429 50.        ]
15


Generate a 4×4 matrix of random integers between 1 and 100.

In [8]:
rand_m = np.random.randint(1,101, size=(4,4))
print(rand_m)
print(rand_m.shape)

[[ 17  26  37  77]
 [ 23  34  64  23]
 [ 58  65  75  57]
 [100  70   6  77]]
(4, 4)


### 2. Indexing, Slicing, and Broadcasting

Create a 3×3 matrix of random integers between 1 and 100.

In [9]:
m = np.random.randint(1,101,size=(3,3))
print(m)

[[72 67 96]
 [21 79  6]
 [79 35 49]]


Extract: First row, Second column, Center element.

In [10]:
print("First Row:", m[0])
print("Secound Col:", m[:,1])

First Row: [72 67 96]
Secound Col: [67 79 35]


In [11]:
def get_matrix_center(mat):
    rows, cols = mat.shape
    row_start = (rows - 1) // 2 if rows % 2 == 1 else rows // 2 - 1
    col_start = (cols - 1) // 2 if cols % 2 == 1 else cols // 2 - 1
    row_end = row_start + (1 if rows % 2 == 1 else 2) # row_start + row_span
    col_end = col_start + (1 if cols % 2 == 1 else 2) # col_start + col_span
    return mat[row_start:row_end, col_start:col_end]

In [12]:
print("Center(3x3):")
print(get_matrix_center(m))
print("Center(4x4):")
print(get_matrix_center(np.random.randint(1,101,size=(4,4))))

Center(3x3):
[[79]]
Center(4x4):
[[69 97]
 [59 29]]


Replace all values greater than 50 in a matrix with 999.

In [13]:
m[m > 50] = 999
print(m)

[[999 999 999]
 [ 21 999   6]
 [999  35  49]]


Multiply a 1D array of size 5 by 10 using broadcasting.

In [14]:
r = np.random.randint(1,11, size=5)
print(r)
r = r * 10
print(r)

[ 5 10  1 10  2]
[ 50 100  10 100  20]


### 3. Mathematical and Statistical Operations

Create a 3×3 matrix of random integers between 1 and 100.

In [15]:
m = np.random.randint(1,101,size=(3,3))
print(m)

[[77 54  2]
 [29 84 49]
 [25 20 75]]


Compute sum, mean, median, std, var, min, and max of the above array.

In [16]:
print("Sum:", m.sum())
print("Mean:", m.mean())
print("Median:", np.median(m))
print("std:", m.std())
print("var:", m.var())
print("min:", m.min())
print("max:", m.max())

Sum: 415
Mean: 46.111111111111114
Median: 49.0
std: 27.204756301648775
var: 740.0987654320987
min: 2
max: 84


Normalize a 1D array of size 5 to scale values between 0 and 1. (Min-Max Normalization)

In [17]:
m = (m - m.min()) / (m.max() - m.min())
m

array([[0.91463415, 0.63414634, 0.        ],
       [0.32926829, 1.        , 0.57317073],
       [0.2804878 , 0.2195122 , 0.8902439 ]])

### 4. NumPy Matrix Operations and Linear Algebra

Define matrices A = [[4, 2], [1, 3]] and B = [[2, 0], [1, 5]].

In [18]:
A = np.array([[4,2], [1,3]])
B = np.array([[2,0], [1,5]])
print(A)
print(B)

[[4 2]
 [1 3]]
[[2 0]
 [1 5]]


Find matrix multiplication of A and B.

In [19]:
matrix_mul = A @ B
print("@:\n",matrix_mul)
matrix_mul = np.matmul(A, B)
print("matmul:\n",matrix_mul)

@:
 [[10 10]
 [ 5 15]]
matmul:
 [[10 10]
 [ 5 15]]


Find dot product of A and B.

In [20]:
dot = np.dot(A, B)
print(dot)

[[10 10]
 [ 5 15]]


Find element-wise addition, subtraction, multiplication, division of A and B.

In [21]:
print("Addition")
print(A+B)

Addition
[[6 2]
 [2 8]]


In [22]:
print("Subtraction")
print(A-B)

Subtraction
[[ 2  2]
 [ 0 -2]]


In [23]:
print("Multiplaction")
print(A*B)

Multiplaction
[[ 8  0]
 [ 1 15]]


In [24]:
print("Division")
with np.errstate(divide='ignore', invalid='ignore'):
    print(A/B)

Division
[[2.  inf]
 [1.  0.6]]


Transpose matrix A.

In [25]:
print(A.T)

[[4 1]
 [2 3]]


Compute determinant of A.

In [26]:
print(np.linalg.det(A))

10.000000000000002


Compute inverse of A.

In [27]:
print(np.linalg.inv(A))

[[ 0.3 -0.2]
 [-0.1  0.4]]


Find eigenvalues and eigenvectors of A.

In [28]:
eig_values, eig_vectors = np.linalg.eig(A)
print("Eigen Values:", eig_values)
print("Eigen Vectors:", eig_vectors)

Eigen Values: [5. 2.]
Eigen Vectors: [[ 0.89442719 -0.70710678]
 [ 0.4472136   0.70710678]]


Solve the system of equations: 2x + y = 8, 3x + 4y = 18.

In [29]:
coeffs = np.array([[2,1], [3,4]])
consts = np.array([8,18])
x, y = np.linalg.solve(coeffs, consts)
print("X:", x)
print("Y:", y)

X: 2.7999999999999994
Y: 2.4000000000000004


## Pandas Exercises

In [30]:
import pandas as pd

### 1. Series & DataFrame Basics

Create a Pandas Series with student marks and names as index.

Given the following list of marks:  
[78, 85, 92, 70, 66]  
Create a Pandas Series and assign the following student names as indices:  
['Amit', 'Bhavna', 'Chetan', 'Divya', 'Esha']  
Display the Series.  


In [31]:
students = ['Amit', 'Bhavna', 'Chetan', 'Divya', 'Esha']
marks = [78, 85, 92, 70, 66]
S = pd.Series(marks, index=students)
print(f"Marks:\n{S}\n")

Marks:
Amit      78
Bhavna    85
Chetan    92
Divya     70
Esha      66
dtype: int64



Create a DataFrame using the provided dictionary.  
  
data = {  
    'Name': ['Amit', 'Bhavna', 'Chetan', 'Divya', 'Esha'],  
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Female'],  
    'Math': [78, 85, 92, 70, 66],  
    'Science': [88, 79, 95, 72, 60]  
}  

In [32]:
data = {
    'Name': ['Amit', 'Bhavna', 'Chetan', 'Divya', 'Esha'],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Female'],
    'Math': [78, 85, 92, 70, 66],
    'Science': [88, 79, 95, 72, 60]
}
data = pd.DataFrame(data)

Display full DataFrame, column names, and shape.

In [33]:
print(data)

     Name  Gender  Math  Science
0    Amit    Male    78       88
1  Bhavna  Female    85       79
2  Chetan    Male    92       95
3   Divya  Female    70       72
4    Esha  Female    66       60


In [34]:
print("Columns:",data.columns)

Columns: Index(['Name', 'Gender', 'Math', 'Science'], dtype='object')


In [35]:
print("Shape:",data.shape)

Shape: (5, 4)


### 2. Data Exploration

**Load dataset from URL: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data**

In [36]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"
df = pd.read_csv(url, header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Assign proper column names.

["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style", "drive-wheels","engine-location","wheel-base","length","width","height","curb-weight","engine-type","num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower", "peak-rpm","city-mpg","highway-mpg","price"]

In [37]:
col_names = [
    "symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
    "drive-wheels","engine-location","wheel-base","length","width","height","curb-weight","engine-type",
    "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower", 
    "peak-rpm","city-mpg","highway-mpg","price"
]
df.columns = col_names

In [38]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [39]:
df.tail()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.4,23.0,106,4800,26,27,22470
204,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,19,25,22625


Display .shape, .columns, .info(), and .describe().

In [40]:
print("Shape:", df.shape)

Shape: (205, 26)


In [41]:
print("Columns:", df.columns)

Columns: Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price'],
      dtype='object')


In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  205 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non

In [43]:
df.describe()

Unnamed: 0,symboling,wheel-base,length,width,height,curb-weight,engine-size,compression-ratio,city-mpg,highway-mpg
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,10.142537,25.219512,30.75122
std,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,3.97204,6.542142,6.886443
min,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,7.0,13.0,16.0
25%,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,8.6,19.0,25.0
50%,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,9.0,24.0,30.0
75%,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,9.4,30.0,34.0
max,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,23.0,49.0,54.0


In [44]:
df.describe(include=object)

Unnamed: 0,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,engine-type,num-of-cylinders,fuel-system,bore,stroke,horsepower,peak-rpm,price
count,205,205,205,205,205,205,205,205,205,205,205,205.0,205.0,205,205,205
unique,52,22,2,2,3,5,3,2,7,7,8,39.0,37.0,60,24,187
top,?,toyota,gas,std,four,sedan,fwd,front,ohc,four,mpfi,3.62,3.4,68,5500,?
freq,41,32,185,168,114,96,120,202,148,159,94,23.0,20.0,19,37,4


Display width, height, curb-weight, engine-type columns.

In [45]:
df4 = df[["width","height","curb-weight","engine-type"]]
df4.head()

Unnamed: 0,width,height,curb-weight,engine-type
0,64.1,48.8,2548,dohc
1,64.1,48.8,2548,dohc
2,65.5,52.4,2823,ohcv
3,66.2,54.3,2337,ohc
4,66.4,54.3,2824,ohc


In [46]:
df4loc = df.loc[:, ["width","height","curb-weight","engine-type"]]
df4loc.head()

Unnamed: 0,width,height,curb-weight,engine-type
0,64.1,48.8,2548,dohc
1,64.1,48.8,2548,dohc
2,65.5,52.4,2823,ohcv
3,66.2,54.3,2337,ohc
4,66.4,54.3,2824,ohc


Display car details with num-of-doors = four.

In [47]:
doors_4 = df[df["num-of-doors"] == "four"]
doors_4.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
7,1,?,audi,gas,std,four,wagon,fwd,front,105.8,...,136,mpfi,3.19,3.4,8.5,110,5500,19,25,18920
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,...,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875


In [48]:
doors_4.shape

(114, 26)

### 3. Missing Values Handling

Replace all '?' with NULL.

In [49]:
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [50]:
df.replace("?", np.nan, inplace=True)
df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


Check how many missing values are in each column.

In [51]:
df.isnull()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
201,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
202,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
203,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [52]:
df.isnull().sum()

symboling             0
normalized-losses    41
make                  0
fuel-type             0
aspiration            0
num-of-doors          2
body-style            0
drive-wheels          0
engine-location       0
wheel-base            0
length                0
width                 0
height                0
curb-weight           0
engine-type           0
num-of-cylinders      0
engine-size           0
fuel-system           0
bore                  4
stroke                4
compression-ratio     0
horsepower            2
peak-rpm              2
city-mpg              0
highway-mpg           0
price                 4
dtype: int64

Replace missing values of normalized-losses, stroke, bore, horsepower with mean.

In [53]:
cols_with_mean = ["normalized-losses", "stroke", "bore", "horsepower"]

for col in cols_with_mean:
    df[col] = pd.to_numeric(df[col], errors='coerce').astype('float64')
    df.fillna({col:df[col].mean()}, inplace=True)

df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000,21,27,13495
1,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000,21,27,16500
2,1,122.0,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500,18,22,17450


In [54]:
df.isnull().sum()

symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         2
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             2
city-mpg             0
highway-mpg          0
price                4
dtype: int64

Drop rows with missing price.

In [55]:
df.shape

(205, 26)

In [56]:
df["price"] = pd.to_numeric(df["price"], errors='coerce')
df.dropna(subset=["price"], inplace=True)
df.shape

(201, 26)

Replace missing values of num-of-doors with mode.

In [57]:
df["num-of-doors"].mode()

0    four
Name: num-of-doors, dtype: object

In [58]:
df.fillna({"num-of-doors": df["num-of-doors"].mode()[0]}, inplace=True)
df["num-of-doors"].isnull().sum()

np.int64(0)

Replace other missing values with median.

In [59]:
df["peak-rpm"] = pd.to_numeric(df["peak-rpm"], errors="coerce").astype('float64')
df.fillna({"peak-rpm":df["peak-rpm"].median()}, inplace=True)
df["peak-rpm"].isnull().sum()

np.int64(0)

confirm there are no more missing values

In [60]:
print(df.isnull().sum())

symboling            0
normalized-losses    0
make                 0
fuel-type            0
aspiration           0
num-of-doors         0
body-style           0
drive-wheels         0
engine-location      0
wheel-base           0
length               0
width                0
height               0
curb-weight          0
engine-type          0
num-of-cylinders     0
engine-size          0
fuel-system          0
bore                 0
stroke               0
compression-ratio    0
horsepower           0
peak-rpm             0
city-mpg             0
highway-mpg          0
price                0
dtype: int64


### 4. Grouping, Sorting, and Aggregation

Group by fuel-type and compute average price.

In [61]:
df["price"] = pd.to_numeric(df["price"], errors='coerce').astype('float64')
avg_price_by_fuel = df.groupby("fuel-type")["price"].mean()
print(avg_price_by_fuel)

fuel-type
diesel    15838.15000
gas       12916.40884
Name: price, dtype: float64


Convert mpg columns to L/100km  
  
In our dataset, the fuel consumption columns "city-mpg" and "highway-mpg" are represented by mpg (miles per gallon) unit.  
Assume we are developing an application in a country that accept the fuel consumption with L/100km standard 
We will need to apply data transformation to transform mpg into L/100km?

The formula for unit conversion is
L/100km = 235 / mpg

In [62]:
df["city-L/100km"] = 235 / pd.to_numeric(df["city-mpg"], errors='coerce').astype('float64')
df["highway-L/100km"] = 235 / pd.to_numeric(df["highway-mpg"], errors='coerce').astype('float64')

df["city-L/100km"] = df["city-L/100km"].round(2)
df["highway-L/100km"] = df["highway-L/100km"].round(2)

df.head()

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price,city-L/100km,highway-L/100km
0,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0,11.19,8.7
1,3,122.0,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0,11.19,8.7
2,1,122.0,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0,12.37,9.04
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0,9.79,7.83
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0,13.06,10.68


Sort DataFrame by price in descending order.

In [63]:
df_sorted = df.sort_values(by="price", ascending=False)
df_sorted[["make", "fuel-type", "price"]].head()

Unnamed: 0,make,fuel-type,price
74,mercedes-benz,gas,45400.0
16,bmw,gas,41315.0
73,mercedes-benz,gas,40960.0
128,porsche,gas,37028.0
17,bmw,gas,36880.0


## Knowledge Check

**1. What is the difference between a NumPy array and a Pandas Series? When would you prefer one over the other in a machine learning pipeline?**

A NumPy array is a fast and memory-efficient container for homogeneous data usually numeric that supports vectorized operations. It does not support labels or metadata. On the other hand, a Pandas Series is essentially a one-dimensional labeled array capable of holding any data type. While NumPy is better suited for numerical computation and when working with dense feature matrices in machine learning pipelines, a Pandas Series is preferred when working with labeled data such as time series, target variables, or data that requires indexing and metadata. In practice, NumPy arrays are often used for model training inputs (X), and Series are used for labels (y) or categorical features.

**2. Consider the following code:**  
arr = np.array([[10, 20], [30, 40]])    
print(arr[:, 1])  
**What will be the output and why? Explain in terms of slicing and indexing**

The output will be the **secound column** of the matrix which is **[20,40]**.  
why do we get this because in indexing np array we select all the rows and secound column.  
The indexing is done as arr[rows, cols] here rows and cols can be a list of values as well.  
The symbol ":" means select all the rows and 1 mean select the first column.

Indexing mean to get a particular element from the array and Slicing means to slice the array into the sub array.

**3. How can you replace all missing values in a Pandas DataFrame column with the median of that column? Give one example line of code.**

we can replace all null values in pandas using fillna which means fillnone or fillnull.  
df["column_name"] = df["column_name"].fillna(df["column_name"].median())

**4. What is the significance of .describe() and .info() functions in data exploration? Mention one key difference between them.**

The .info() function provides a concise summary of the DataFrame structure, including column names, non-null counts, and data types. It is especially useful for identifying missing values and understanding the schema of your dataset. The .describe() function, in contrast, provides descriptive statistics for numeric columns such as count, mean, standard deviation, minimum, and maximum. One key difference between the two is that .info() focuses on data structure and memory usage, while .describe() focuses on the statistical distribution of the data. Both are essential tools for quickly exploring and understanding a dataset.

**5. You are given a column city-mpg in a dataset. How would you convert it to L/100km using Pandas? Also, explain what this transformation means practically.**

To convert city-mpg to the fuel consumption format used in many countries (liters per 100 kilometers, or L/100km), you use the formula L/100km = 235 / mpg. This transformation inverts the logic of fuel efficiency: in mpg, higher values are better (more miles per gallon), whereas in L/100km, lower values are better (less fuel used per 100 kilometers).  
Here's how you can implement the conversion using Pandas:  
df["city-L/100km"] = 235 / pd.to_numeric(df["city-mpg"], errors='coerce')
This transformation allows easier comparison and compliance with global fuel standards, especially in regions like Europe, India, and Canada.