# DS Programming Tutorial & Practice

Welcome to this **Data Science Programming Tutorial** notebook!

In this notebook, we will cover:
1. **Reading data files** (CSV, JSON)
2. **Viewing and retrieving data** (head, tail, specific indices)
3. **Basic data manipulation** (insertion, deletion)
4. **Dataset summaries** (types, shape, info)
5. **Data preprocessing** (missing values, duplicates, sampling, binning)

You will also find **Practice Problems** with **Solution Examples** so that you can practice each concept immediately after learning it. Feel free to modify the data paths, columns, or values to match your own datasets.

## 1. Importing Modules and Reading Files

Below, we import `pandas` (the main Python library for data manipulation), then we load a CSV file.
- Replace the file path (`"./Data Files/data.csv"`) with the path of your own CSV if necessary.

In [2]:
import pandas as pd  # Import the pandas library

# Load a CSV file from a specified path.
# Make sure the path is correct, or else it will raise a FileNotFoundError.
df_csv = pd.read_csv("./data.csv")

# Attempt to read a JSON file.
# Make sure the file exists and is valid JSON.
df_json = pd.read_json("./data.json")


In [4]:
# Check data types
df_csv.dtypes

Make                  object
Model                 object
Year                   int64
Engine Fuel Type      object
Engine HP            float64
Engine Cylinders     float64
Transmission Type     object
Driven_Wheels         object
Number of Doors      float64
Market Category       object
Vehicle Size          object
Vehicle Style         object
highway MPG            int64
city mpg               int64
Popularity             int64
MSRP                   int64
dtype: object

In [5]:
# Check (rows, columns)
df_csv.shape

(11914, 16)

In [6]:
# Print info
df_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11914 entries, 0 to 11913
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Make               11914 non-null  object 
 1   Model              11914 non-null  object 
 2   Year               11914 non-null  int64  
 3   Engine Fuel Type   11911 non-null  object 
 4   Engine HP          11845 non-null  float64
 5   Engine Cylinders   11884 non-null  float64
 6   Transmission Type  11914 non-null  object 
 7   Driven_Wheels      11914 non-null  object 
 8   Number of Doors    11908 non-null  float64
 9   Market Category    8172 non-null   object 
 10  Vehicle Size       11914 non-null  object 
 11  Vehicle Style      11914 non-null  object 
 12  highway MPG        11914 non-null  int64  
 13  city mpg           11914 non-null  int64  
 14  Popularity         11914 non-null  int64  
 15  MSRP               11914 non-null  int64  
dtypes: float64(3), int64(5

## 2. Viewing Rows / Specific Indices

After loading, we display the **first few rows** using `head()`. By default, it shows 5 rows if no number is specified.

In [7]:
# Show the first few rows of the DataFrame.
# head() defaults to 5 rows, but you can specify head(n) to see n.
df_csv.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500



To see the **end** of the dataset, we use the `.tail()` method. We can also retrieve **specific rows** using `.loc[]` with integer indices.

In [8]:
# Show the last 5 rows of df_csv.
# This is helpful to see if your dataset has a footer or extra lines.
df_csv.tail()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
11909,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,46120
11910,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,56670
11911,Acura,ZDX,2012,premium unleaded (required),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,50620
11912,Acura,ZDX,2013,premium unleaded (recommended),300.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Hatchback,Luxury",Midsize,4dr Hatchback,23,16,204,50920
11913,Lincoln,Zephyr,2006,regular unleaded,221.0,6.0,AUTOMATIC,front wheel drive,4.0,Luxury,Midsize,Sedan,26,17,61,28995


For specific indices:

In [9]:
# Select rows by indices. We'll choose 1,4,7 if they exist.
indices_to_view = [1, 4, 7]

df_csv.loc[indices_to_view]

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500
7,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,39300


### Practice Problem (2)
1. display the last 3 rows.
2. select specific rows (6,9,11) by index.

## 3. Inserting (Appending) a New Row

We can add new data (rows) to our existing DataFrame using pandas. 

In older versions of pandas, `.append()` was common, but it is deprecated. 

Instead, we use `pd.concat()`.

In [10]:
# Create a dictionary that represents a new row.
new_row = {
    'Make': 'Hyundai',
    'Model': 141,
    'Year': 2025,
    'Engine Fuel Type': 'regular unleaded',
    'Engine HP': 217,
    'Engine Cylinders': 5,
    'Transmission Type': 'MANUAL',
    'Driven_Wheels': 'all wheel drive',
    'Number of Doors': '4',
    'Market Category': 'luxury',
    'Vehicle Size': 'middle',
    'Vehicle Style': 'sedan',
    'highway MPG': 22,
    'city mpg': 16,
    'Popularity': 3105,
    'MSRP': 2000
}

In [11]:
# Convert the dictionary to a DataFrame with one row.
temp_df = pd.DataFrame([new_row])

# Concatenate (stack) the new row to our existing DataFrame.
# ignore_index=True ensures the final DataFrame has a continuous index.
df_csv = pd.concat([df_csv, temp_df], ignore_index=True)

In [16]:
df_csv.tail(1)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
11914,Hyundai,141,2025,regular unleaded,217.0,5.0,MANUAL,all wheel drive,4,luxury,middle,sedan,22,16,3105,2000


### Practice Problem
1. Create another dictionary representing a new row with specific attributes. The ‘Make’ should be set to ‘Tesla’, the ‘Model’ should be 3, the ‘Year’ should be 2025, the ‘Engine Fuel Type’ should be ‘electronic’, and the ‘MSRP’ should be 80,000. Any values that are unknown should be filled with 'NA' or 0 (for integer).
2. Append it to the DataFrame.
3. Display **the last row** to confirm it was added.

## 5. Deleting Rows

We use `.drop()` to remove rows by index. If we set `inplace=True`, it modifies the DataFrame immediately. Otherwise, it returns a **new** DataFrame with the rows removed.

In [17]:
# Delete the last row in-place.
df_csv.drop(df_csv.index[-1], inplace=True)

In [18]:
df_csv.tail(1)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
11913,Lincoln,Zephyr,2006,regular unleaded,221.0,6.0,AUTOMATIC,front wheel drive,4.0,Luxury,Midsize,Sedan,26,17,61,28995


In [19]:
df_csv.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


In [20]:
# Delete multiple rows: indices 0 and 2, for example.
indices_to_drop = [0, 2]
# Only drop them if they exist in the DataFrame.
valid_drop_indices = [idx for idx in indices_to_drop if idx < len(df_csv)]
df_csv.drop(df_csv.index[valid_drop_indices], inplace=True)

# Show the first 5 rows to see the result.
df_csv.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500
5,BMW,1 Series,2012,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,31200
6,BMW,1 Series,2012,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,26,17,3916,44100


### Practice Problem (5)
1. Choose **two or three** indices from your DataFrame and drop them.
2. Confirm by checking those rows again (you should see an error or no data, indicating they're gone).

## 6. Finding Min, Max, Mean, Median, and Mode

In this section, we will explore how to:
- Find the **minimum** and **maximum** values of a specific column.
- Compute the **mean**, **median**, and **mode**.
- Retrieve the rows corresponding to these statistical values.


### Finding Min and Max Values
We can use `.min()` and `.max()` to find the lowest and highest values in a specific column.

In [21]:
column_name = 'MSRP'  # Change to any numerical column
min_value = df_csv[column_name].min()
max_value = df_csv[column_name].max()
print(f'Min {column_name}:', min_value)
print(f'Max {column_name}:', max_value)

Min MSRP: 2000
Max MSRP: 2065902


### Retrieving Rows with Min and Max Values
To find the rows where the **minimum** or **maximum** values occur, we use boolean indexing.

In [22]:
min_row = df_csv[df_csv[column_name] == min_value]
max_row = df_csv[df_csv[column_name] == max_value]
display(min_row)
display(max_row)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
17,Audi,100,1992,regular unleaded,172.0,6.0,MANUAL,front wheel drive,4.0,Luxury,Midsize,Sedan,24,17,3105,2000
18,Audi,100,1992,regular unleaded,172.0,6.0,MANUAL,front wheel drive,4.0,Luxury,Midsize,Sedan,24,17,3105,2000
19,Audi,100,1992,regular unleaded,172.0,6.0,AUTOMATIC,all wheel drive,4.0,Luxury,Midsize,Wagon,20,16,3105,2000
20,Audi,100,1992,regular unleaded,172.0,6.0,MANUAL,front wheel drive,4.0,Luxury,Midsize,Sedan,24,17,3105,2000
21,Audi,100,1992,regular unleaded,172.0,6.0,MANUAL,all wheel drive,4.0,Luxury,Midsize,Sedan,21,16,3105,2000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11481,Suzuki,X-90,1998,regular unleaded,95.0,4.0,MANUAL,four wheel drive,2.0,,Compact,2dr SUV,26,22,481,2000
11482,Suzuki,X-90,1998,regular unleaded,95.0,4.0,MANUAL,rear wheel drive,2.0,,Compact,2dr SUV,26,22,481,2000
11792,Subaru,XT,1991,regular unleaded,97.0,4.0,MANUAL,front wheel drive,2.0,,Compact,Coupe,29,22,640,2000
11793,Subaru,XT,1991,regular unleaded,145.0,6.0,AUTOMATIC,front wheel drive,2.0,,Compact,Coupe,26,18,640,2000


Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
11362,Bugatti,Veyron 16.4,2008,premium unleaded (required),1001.0,16.0,AUTOMATED_MANUAL,all wheel drive,2.0,"Exotic,High-Performance",Compact,Coupe,14,8,820,2065902


### Calculating Mean, Median, and Mode
- **Mean**: The average value of a numerical column.
- **Median**: The middle value when sorted.
- **Mode**: The most frequently occurring value.

In [23]:
mean_value = df_csv[column_name].mean()
median_value = df_csv[column_name].median()
mode_value = df_csv[column_name].mode()[0]  # Mode may return multiple values, taking the first one

print(f'Mean {column_name}:', mean_value)
print(f'Median {column_name}:', median_value)
print(f'Mode {column_name}:', mode_value)

Mean MSRP: 40594.6282740094
Median MSRP: 29995.0
Mode MSRP: 2000


### Retrieving Rows with Mode Value
Since the mode is the most frequently occurring value, multiple rows may match.

In [25]:
mode_rows = df_csv[df_csv[column_name] == mode_value]
display(mode_rows)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
17,Audi,100,1992,regular unleaded,172.0,6.0,MANUAL,front wheel drive,4.0,Luxury,Midsize,Sedan,24,17,3105,2000
18,Audi,100,1992,regular unleaded,172.0,6.0,MANUAL,front wheel drive,4.0,Luxury,Midsize,Sedan,24,17,3105,2000
19,Audi,100,1992,regular unleaded,172.0,6.0,AUTOMATIC,all wheel drive,4.0,Luxury,Midsize,Wagon,20,16,3105,2000
20,Audi,100,1992,regular unleaded,172.0,6.0,MANUAL,front wheel drive,4.0,Luxury,Midsize,Sedan,24,17,3105,2000
21,Audi,100,1992,regular unleaded,172.0,6.0,MANUAL,all wheel drive,4.0,Luxury,Midsize,Sedan,21,16,3105,2000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11481,Suzuki,X-90,1998,regular unleaded,95.0,4.0,MANUAL,four wheel drive,2.0,,Compact,2dr SUV,26,22,481,2000
11482,Suzuki,X-90,1998,regular unleaded,95.0,4.0,MANUAL,rear wheel drive,2.0,,Compact,2dr SUV,26,22,481,2000
11792,Subaru,XT,1991,regular unleaded,97.0,4.0,MANUAL,front wheel drive,2.0,,Compact,Coupe,29,22,640,2000
11793,Subaru,XT,1991,regular unleaded,145.0,6.0,AUTOMATIC,front wheel drive,2.0,,Compact,Coupe,26,18,640,2000


## Practice Problems

1. Find the **minimum and maximum values** for the column `'Year'` and display the corresponding rows.

2. Calculate the **mean, median, and mode** for the column `'Engine Fuel Type'`. What happens if it's a non-numerical column?

3. Identify the most frequently occurring **Make** in the dataset and display all rows with that `Make`.

In [None]:
# Write your answers here
# Example: df_csv['Year'].min(), df_csv['Year'].max()


### Calculating Dispersion (Variance, Standard Deviation)
Dispersion measures indicate how spread out the data is:
- **Variance**: The average squared deviation from the mean.
- **Standard Deviation**: The square root of variance, indicating spread in original units.

In [26]:
variance_value = df_csv[column_name].var()
std_dev_value = df_csv[column_name].std()

print(f'Variance of {column_name}:', variance_value)
print(f'Standard Deviation of {column_name}:', std_dev_value)

Variance of MSRP: 3613706929.9585266
Standard Deviation of MSRP: 60114.11589600671


## Practice Problems

1. Find the **minimum and maximum values** for the column `'Year'` and display the corresponding rows.

2. Calculate the **mean, median, and mode** for the column `'Engine Fuel Type'`. What happens if it's a non-numerical column?

3. Compute the **variance and standard deviation** for the column `'MSRP'`. What does the result tell you about the spread of values?

In [24]:
# Write your answers here
# Example: df_csv['Year'].min(), df_csv['Year'].max()
