# Topic 1 – Reading Data

In this notebook we focus on **reading and understanding data**.

The goal is to:
- load a real dataset from a CSV file
- inspect its structure
- understand observations, features, and missing values

No modeling or prediction is performed here.


## What Does Reading Data Mean?

Reading data means:
- loading data from a file into Python
- representing it in a structured form
- understanding its rows, columns, and values

In this course, data is usually read from **CSV files**
and stored in **pandas DataFrames**.


## Import Required Libraries

In [1]:
import pandas as pd
import numpy as np

## Loading a CSV File

We now load a dataset from a CSV file.

The dataset is stored in the `datasets/` folder.


In [2]:
df = pd.read_csv("../datasets/raw/Topic1/ILPD.csv")
df

Unnamed: 0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
0,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
1,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
2,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.00,1
3,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.40,1
4,46,Male,1.8,0.7,208,19,14,7.6,4.4,1.30,1
...,...,...,...,...,...,...,...,...,...,...,...
577,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
578,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.10,1
579,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.00,1
580,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.00,1


## Reading TXT Files

Datasets are sometimes stored in TXT files instead of CSV files.

TXT files usually contain tabular data where values are separated
by a specific **delimiter**.

Common delimiters:
- comma (,)
- semicolon (;)
- tab (\\t)
- pipe (|)


In [3]:
df_txt = pd.read_csv(
    "../datasets/raw/Topic1/AutoInsurSweden.txt",
    sep=r"\s+",      # split on whitespace (spaces or tabs)
    decimal=","      # comma is decimal separator
)
df_txt.head()

# If the data does not load correctly, the delimiter may be wrong.
# We can try different delimiters until the structure is correct.

Unnamed: 0,X,Y
0,108,392.5
1,19,46.2
2,13,15.7
3,124,422.2
4,40,119.4


## Inspecting the First Rows

To quickly understand a dataset, we usually look at its first few rows.


In [4]:
df.head()

# This shows:
# - column names
# - example values
# - general structure

Unnamed: 0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
0,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
1,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
2,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.0,1
3,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.4,1
4,46,Male,1.8,0.7,208,19,14,7.6,4.4,1.3,1


## Dataset Dimensions

We now check:
- how many observations (rows) the dataset has
- how many features (columns) it contains

In [5]:
# Number of Rows and Columns

df.shape

# (number_of_rows, number_of_columns)

(582, 11)

## Columns and Data Types

Each column has:
- a name
- a data type (numeric, text, etc.)

Understanding data types is essential before analysis.


In [6]:
df.info()

# This shows
# - columns names
# - data types
# - number of non-missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 582 entries, 0 to 581
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   65      582 non-null    int64  
 1   Female  582 non-null    object 
 2   0.7     582 non-null    float64
 3   0.1     582 non-null    float64
 4   187     582 non-null    int64  
 5   16      582 non-null    int64  
 6   18      582 non-null    int64  
 7   6.8     582 non-null    float64
 8   3.3     582 non-null    float64
 9   0.9     578 non-null    float64
 10  1       582 non-null    int64  
dtypes: float64(5), int64(5), object(1)
memory usage: 50.1+ KB


## Missing Values

Real-world datasets often contain missing values.

In pandas, missing values are represented as **NaN**.


In [7]:
df.isna().sum()

65        0
Female    0
0.7       0
0.1       0
187       0
16        0
18        0
6.8       0
3.3       0
0.9       4
1         0
dtype: int64

## Dropping Rows with Missing Values

One simple way to handle missing values is to **remove observations**
that contain missing data.

This means:
- each row with at least one NaN value is removed
- the dataset becomes smaller but cleaner

Dropping rows is appropriate when:
- the number of missing values is small
- the dataset is large enough


In [8]:
df_clean = df.dropna()
df_clean

Unnamed: 0,65,Female,0.7,0.1,187,16,18,6.8,3.3,0.9,1
0,62,Male,10.9,5.5,699,64,100,7.5,3.2,0.74,1
1,62,Male,7.3,4.1,490,60,68,7.0,3.3,0.89,1
2,58,Male,1.0,0.4,182,14,20,6.8,3.4,1.00,1
3,72,Male,3.9,2.0,195,27,59,7.3,2.4,0.40,1
4,46,Male,1.8,0.7,208,19,14,7.6,4.4,1.30,1
...,...,...,...,...,...,...,...,...,...,...,...
577,60,Male,0.5,0.1,500,20,34,5.9,1.6,0.37,2
578,40,Male,0.6,0.1,98,35,31,6.0,3.2,1.10,1
579,52,Male,0.8,0.2,245,48,49,6.4,3.2,1.00,1
580,31,Male,1.3,0.5,184,29,32,6.8,3.4,1.00,1


## Effect of Dropping Rows

We compare the size of the dataset before and after dropping missing values.


In [9]:
print("Original dataset shape:", df.shape)
print("Cleaned dataset shape:", df_clean.shape)

Original dataset shape: (582, 11)
Cleaned dataset shape: (578, 11)


## Saving a DataFrame to a CSV File

After cleaning or transforming a dataset, it is often useful to
**save the result to a new CSV file**.

This allows us to:
- reuse cleaned data later
- separate raw data from processed data
- ensure reproducibility of the analysis

In [10]:
df_clean.to_csv( # writes the DataFrame to a CSV file
    "../datasets/processed/Topic1/ILPD_clean.csv", # location of saved file
    index=False # avoids saving the row index as a column
)

### Why We Save Processed Data Separately

- `datasets/raw/` contains original, unchanged data
- `datasets/processed/` contains analytical decisions

Saving processed data makes workflows:
- reproducible
- transparent
- easier to continue in later topics


## Observations

Reading TXT files follows the same logic as CSV files.
The only difference is specifying the correct delimiter.

Once loaded, TXT data is handled exactly like CSV data
using pandas DataFrames.


## Basic Statistical Overview

Before any modeling, it is useful to look at basic statistics
for numerical columns.


In [11]:
df_clean.describe()

# Includes:
# - count
# - mean
# - standard deviation
#  min/ max

Unnamed: 0,65,0.7,0.1,187,16,18,6.8,3.3,0.9,1
count,578.0,578.0,578.0,578.0,578.0,578.0,578.0,578.0,578.0,578.0
mean,44.747405,3.319896,1.49654,291.546713,81.238754,110.574394,6.481142,3.138235,0.947145,1.285467
std,16.213968,6.232158,2.81834,243.734041,183.321431,290.075539,1.0855,0.795094,0.319863,0.452028
min,4.0,0.4,0.1,63.0,10.0,10.0,2.7,0.9,0.3,1.0
25%,33.0,0.8,0.2,175.25,23.25,25.0,5.8,2.6,0.7,1.0
50%,45.0,1.0,0.3,208.5,35.0,42.0,6.6,3.1,0.94,1.0
75%,58.0,2.6,1.3,298.0,61.0,87.0,7.2,3.8,1.1,2.0
max,90.0,75.0,19.7,2110.0,2000.0,4929.0,9.6,5.5,2.8,2.0


## Observations and Features

- Each row represents one **observation**
- Each column represents one **feature**

Understanding this mapping is essential for regression
and machine learning later.


## What Comes Next?

Now that the data is loaded and understood, we are ready to:

- select variables
- split data into training and test sets
- build regression models

➡️ **Next: Topic 2 – Linear Regression**
