## Part 1

Numerical variables are often thought of as being continuous, meaning that the values can take on any real number. However, there are some numerical variables that are actually categorical, even though they are measured on a continuous scale. These variables are called ordinal variables.

1. Discrete Numerical Variables:
- Discrete numerical variables are those that can only take on specific, distinct values within a certain range. These values are often integers and represent counts or categories. While these variables are numerical, they can be treated as categorical in certain situations.

- Consider a dataset of families, where each row represents a family and the numerical variable "Number of Children" represents the count of children in each family. While the number of children is a numerical value, it's also a category: families can have 0, 1, 2, 3, or more children. In this context, "Number of Children" can be treated as a categorical variable.

Suppose we have customer ratings for a product, ranging from 1 to 5, representing "Very Dissatisfied," "Dissatisfied," "Neutral," "Satisfied," and "Very Satisfied." Although these ratings are numerical, they represent categories and can be treated as a categorical variable for sentiment analysis.

2. Ordinal Variables:
Ordinal variables have a natural order, but the difference between each order is not necessarily equal. For example, the variable "education level" is an ordinal variable. It can be measured on a scale of 1 to 5, where 1 is "less than high school diploma" and 5 is "doctoral degree." This means that we know that a person with a doctoral degree has a higher education level than a person with a high school diploma, but we don't know by how much.

Ordinal variables are important to keep in mind when doing statistical analysis. Because they have a natural order, we can sometimes treat them as continuous variables, but we need to be careful not to make assumptions about the size of the differences between the categories.

**Here are some examples of numerical variables that can be considered categorical:**
- Education level
- Rankings (e.g., a restaurant rating)
- Grades (e.g., an A, B, C, D, or F)
- Severity of a disease (e.g., mild, moderate, severe)

In conclusion, there are some numerical variables that can be considered categorical. These variables are called ordinal variables or Discrete Numerical Variables. Ordinal variables have a natural order, but the difference between each order is not necessarily equal.

# Part2

In [39]:
import pandas as pd
import numpy as np

In [40]:
# Load the kc_house_data dataset
df = pd.read_csv('kc_house_data.csv')

In [41]:
# Take a subset of the dataset
columns = ['bedrooms', 'bathrooms', 'sqft_living', 'price']
subset = df[columns].head(200)
subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   bedrooms     200 non-null    int64  
 1   bathrooms    200 non-null    float64
 2   sqft_living  200 non-null    int64  
 3   price        200 non-null    float64
dtypes: float64(2), int64(2)
memory usage: 6.4 KB


**Randomly generate missing values for each selected column, with a probability of 10%, 20%, or 30% for each selected column.**

In [42]:
# Introduce missing values pourcentages 
missing_percentages = [0.1, 0.2, 0.3]

In [43]:
# missing values for 'bedrooms' (pourcentages=0.1)
num_missing_bedrooms = int(subset['bedrooms'].size * missing_percentages[0])
subset.loc[np.random.choice(subset.index, num_missing_bedrooms, replace=False), 'bedrooms'] = np.nan
subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   bedrooms     180 non-null    float64
 1   bathrooms    200 non-null    float64
 2   sqft_living  200 non-null    int64  
 3   price        200 non-null    float64
dtypes: float64(3), int64(1)
memory usage: 6.4 KB


**We can see that only the bedrooms column has missing values because we introduced missing values to it.**

In [44]:
# missing values for 'bathrooms' (pourcentages=0.2)
num_missing_bathrooms = int(subset['bathrooms'].size * missing_percentages[1])
subset.loc[np.random.choice(subset.index, num_missing_bathrooms, replace=False), 'bathrooms'] = np.nan
subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   bedrooms     180 non-null    float64
 1   bathrooms    160 non-null    float64
 2   sqft_living  200 non-null    int64  
 3   price        200 non-null    float64
dtypes: float64(3), int64(1)
memory usage: 6.4 KB


**We can see that both bedrooms and bathrooms have missing values because we introduced missing values to them, but with different numbers because we used different percentages.**

In [45]:
#missing values for 'sqft_living'
num_missing_sqft_living = int(subset['sqft_living'].size * missing_percentages[2])
subset.loc[np.random.choice(subset.index, num_missing_sqft_living, replace=False), 'sqft_living'] = np.nan
subset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   bedrooms     180 non-null    float64
 1   bathrooms    160 non-null    float64
 2   sqft_living  140 non-null    float64
 3   price        200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


**The DataFrame has missing values in the bedrooms, bathrooms, and sqft_living columns because we introduced missing values to them with different percentages. However, the prices column does not have any missing values because we did not introduce missing values to it.**

In [46]:
# Drop rows with any missing values
df1_dropped_rows = subset.dropna()
df1_dropped_rows.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 103 entries, 0 to 194
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   bedrooms     103 non-null    float64
 1   bathrooms    103 non-null    float64
 2   sqft_living  103 non-null    float64
 3   price        103 non-null    float64
dtypes: float64(4)
memory usage: 4.0 KB


**Dropping all rows with missing values is inefficient because it reduces the number of entries from 200 to 103. This is a significant loss of data, which can be harmful to the analysis.**

In [49]:
#Replace missing values with the median of each column
df2_with_median = subset.fillna(subset.median())
df2_with_median.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   bedrooms     200 non-null    float64
 1   bathrooms    200 non-null    float64
 2   sqft_living  200 non-null    float64
 3   price        200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


In [51]:
#Drop rows based on a specific condition
condition = subset['bedrooms'] > 3
df3_condition = subset[~condition]
df3_condition.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 131 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   bedrooms     111 non-null    float64
 1   bathrooms    105 non-null    float64
 2   sqft_living  97 non-null     float64
 3   price        131 non-null    float64
dtypes: float64(4)
memory usage: 5.1 KB


**There are 131 entries in the dataset, and 69 of them have more than 3 bedrooms.**

In [53]:
#Drop columns with any missing values
df4_dropped_columns = subset.dropna(axis=1)
df4_dropped_columns.head()

Unnamed: 0,price
0,221900.0
1,538000.0
2,180000.0
3,604000.0
4,510000.0


## Fill missing values using forward fill (new method)

**A forward fill (often abbreviated as "ffill") involves propagating the last known non-missing value forward along the column until a new non-missing value is encountered. In other words, if there's a missing value in a cell, the value from the previous row (the one above it) is copied into that cell. This process is performed column-wise.**

![0_YwUB4dPvVs2GQHLV.png](attachment:0_YwUB4dPvVs2GQHLV.png)

In [54]:
# Fill missing values using forward fill
df5_ffill = subset.fillna(method='ffill')

In [56]:
subset.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,price
0,3.0,1.0,1180.0,221900.0
1,3.0,2.25,2570.0,538000.0
2,2.0,,770.0,180000.0
3,4.0,,1960.0,604000.0
4,3.0,,,510000.0


In [57]:
df5_ffill.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,price
0,3.0,1.0,1180.0,221900.0
1,3.0,2.25,2570.0,538000.0
2,2.0,2.25,770.0,180000.0
3,4.0,2.25,1960.0,604000.0
4,3.0,2.25,1960.0,510000.0


**A backward fill (often abbreviated as "bfill") involves propagating the next known non-missing value backward along the column until a new non-missing value is encountered. In other words, if there's a missing value in a cell, the value from the following row (the one below it) is copied into that cell. This process is performed column-wise.**

![bfill.png](attachment:bfill.png)

In [55]:
#Fill missing values using backward fill
df5_bfill = subset.fillna(method='bfill')

In [60]:
subset.head(10)

Unnamed: 0,bedrooms,bathrooms,sqft_living,price
0,3.0,1.0,1180.0,221900.0
1,3.0,2.25,2570.0,538000.0
2,2.0,,770.0,180000.0
3,4.0,,1960.0,604000.0
4,3.0,,,510000.0
5,4.0,,,1230000.0
6,3.0,2.25,,257500.0
7,3.0,1.5,,291850.0
8,3.0,,1780.0,229500.0
9,3.0,2.5,,323000.0


In [61]:
df5_bfill.head(10)

Unnamed: 0,bedrooms,bathrooms,sqft_living,price
0,3.0,1.0,1180.0,221900.0
1,3.0,2.25,2570.0,538000.0
2,2.0,2.25,770.0,180000.0
3,4.0,2.25,1960.0,604000.0
4,3.0,2.25,1780.0,510000.0
5,4.0,2.25,1780.0,1230000.0
6,3.0,2.25,1780.0,257500.0
7,3.0,1.5,1780.0,291850.0
8,3.0,2.5,1780.0,229500.0
9,3.0,2.5,3560.0,323000.0
