# Tasks 
1. Load a dataset from a URL.
2. Identify columns with missing values and decide whether to fill them or drop them.
3. Convert a categorical column into numerical format (e.g., gender → 0/1).
4. Save the cleaned dataset to a new CSV file.

## Let's start 
For this task i have choosen the following data set 
| Dataset                | URL (direct CSV)                                                                     | Description                                                             | Source  |
| ---------------------- | ------------------------------------------------------------------------------------ | ----------------------------------------------------------------------- | ------- |
| **Iris**               | `https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv`             | Famous dataset for flower classification.                               | Seaborn |

## Information about this dataset 

The Iris dataset is one of the most famous datasets in machine learning and statistics.
It was introduced by Ronald A. Fisher in 1936 as an example for linear discriminant analysis and has since become a benchmark for classification algorithms.

Context
The dataset contains 150 samples of **iris flowers** from **three different species**:

- Setosa
- Versicolor
- Virginica

For each sample, four features were measured:

- Sepal length (cm)
- Sepal width (cm)
- Petal length (cm)
- Petal width (cm)

### The classic task with the Iris dataset

> Given the measurements of a flower, predict its species 

This is a multi-class classification problem, where the model must assign one of three possible labels.



In [11]:
import pandas as pd
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv"
df = pd.read_csv(url)

In [12]:
# Quick overview
print(df.shape)
print(df.head())

(150, 5)
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


We have 150 rows and 5 colums 

Notice the column names: `sepal_length`, `sepal_width`, `petal_length`, `petal_width`, `species`.

In [13]:
# Quick overview
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


### What this output means?

- `<class 'pandas.core.frame.DataFrame'>`
    The data is stored as a pandas DataFrame.

- `RangeIndex: 150 entries, 0 to 149`
    There are 150 rows, indexed from 0 to 149.

- `Data columns (total 5 columns):`
    The DataFrame has 5 columns.

- Column details:
    - `sepal_length`: 150 non-null values, type float64 (decimal numbers)
    - `sepal_width`: 150 non-null values, type float64
    - `petal_length`: 150 non-null values, type float64
    - `petal_width`: 150 non-null values, type float64
    - `species`: 150 non-null values, type object (usually strings, e.g., flower species)

- `dtypes: float64(4), object(1)`
    Four columns are numeric (`float64`), one is categorical (`object`).

- `memory usage: 6.0+ KB`
    The DataFrame uses about 6 KB of memory.

### Then what we can conclude about this information?

The dataset has 150 rows and 5 columns, with no missing values. Four columns are numeric measurements, and one column (species) is categorical

In [17]:
# Basic statistics
df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [18]:
# Checking missing values
df.isnull().sum()

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

In [19]:
# Checking unique values in the 'species' column
df['species'].unique()


array(['setosa', 'versicolor', 'virginica'], dtype=object)

In [23]:
# Using pandas get_dummies
df_onehot = pd.get_dummies(df, columns=['species'],dtype=int)
df_onehot.head()


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species_setosa,species_versicolor,species_virginica
0,5.1,3.5,1.4,0.2,1,0,0
1,4.9,3.0,1.4,0.2,1,0,0
2,4.7,3.2,1.3,0.2,1,0,0
3,4.6,3.1,1.5,0.2,1,0,0
4,5.0,3.6,1.4,0.2,1,0,0


This adds columns:
```
species_setosa | species_versicolor | species_virginica
```

where each row has a 1 in the column corresponding to its species.

In [None]:
# Save cleaned one-hot encoded dataset in the current directory
df_onehot.to_csv("iris_clean.csv", index=False)