# NUMERICAL VALUES CASE : CATEGORICAL

### Numerical variables can be considered categorical if they have a finite number of possible values and the values do not have a natural ordering

### KEYPOINTS Recognizing Numerical Categorical Data

#### Recognizing numerical categorical data involves identifying columns that contain numerical values but are actually used to represent categories or labels rather than continuous quantities. Here are three key points to help you recognize numerical categorical data:

#### Limited Number of Unique Values: Numerical categorical columns typically have a small, fixed number of unique values. These values represent different categories or labels rather than a wide range of continuous numerical values.

#### No Meaningful Arithmetic Operations: In numerical categorical columns, arithmetic operations (e.g., addition, subtraction) often don't make sense. The values are used as labels, and performing arithmetic on them doesn't provide meaningful results.

#### Semantic Meaning: Numerical categorical values carry a semantic meaning that represents categories or labels. For example, in my code below, the "blood_type" column uses numerical values (A, B, AB, O), but these values represent blood types, not actual numerical quantities.

### EXAMPLE 

In [None]:
import pandas as pd

# Create a dataframe with some numerical categorical variables
df = pd.DataFrame({
    "gender": ["male", "female", "male", "female"],
    "blood_type": ["A", "B", "AB", "O"],
    "zip_code": [10001, 20001, 30001, 40001],
    "eye_color": ["brown", "blue", "green", "hazel"],
    "marital_status": ["single", "married", "divorced", "widowed"],
    "house_number": [10001, 20001, 30001, 40001]
})

# Print the dataframe
print(df)




   gender blood_type  zip_code eye_color marital_status  house_number
0    male          A     10001     brown         single         10001
1  female          B     20001      blue        married         20001
2    male         AB     30001     green       divorced         30001
3  female          O     40001     hazel        widowed         40001


In [None]:
# Print the column names
print(df.columns)
print()


Index(['gender', 'blood_type', 'zip_code', 'eye_color', 'marital_status',
       'house_number'],
      dtype='object')



In [None]:
# Print the data types of each column
print(df.dtypes)
print()


gender            object
blood_type        object
zip_code           int64
eye_color         object
marital_status    object
house_number       int64
dtype: object



In [None]:
# Print the number of rows and columns in the dataframe
print(df.shape)
print()


(4, 6)



In [None]:
# Print the unique values in each column
for col in df.columns:
    print(df[col].unique())
print()


['male' 'female']
['A' 'B' 'AB' 'O']
[10001 20001 30001 40001]
['brown' 'blue' 'green' 'hazel']
['single' 'married' 'divorced' 'widowed']
[10001 20001 30001 40001]



In [None]:
# Print the frequency of each value in each column
for col in df.columns:
    print(df[col].value_counts())

male      2
female    2
Name: gender, dtype: int64
A     1
B     1
AB    1
O     1
Name: blood_type, dtype: int64
10001    1
20001    1
30001    1
40001    1
Name: zip_code, dtype: int64
brown    1
blue     1
green    1
hazel    1
Name: eye_color, dtype: int64
single      1
married     1
divorced    1
widowed     1
Name: marital_status, dtype: int64
10001    1
20001    1
30001    1
40001    1
Name: house_number, dtype: int64


#### Explanation for The DF Created

 #### The columns in the dataframe are considered categorical because the values in each column are discrete and do not have a natural ordering.

#### For example, the values in the gender column are "male" and "female". These are discrete values, and they do not have a natural ordering. It does not make sense to say that "male" is greater than "female" or that "female" is less than "male".

#### The same is true for the other columns in the dataframe. The values in the blood_type, zip_code, eye_color, marital_status, and house_number columns are all discrete, and they do not have a natural ordering.

#### Therefore, the columns in the dataframe are considered categorical.



## Part 2

 ### Reading the data

In [4]:
import pandas as pd
df = pd.read_csv(r"C:\Users\Lenovo\Desktop\workspace\week 1\Python\Pandas_1\kc_house_data.csv")


Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.00,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.7210,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.00,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.00,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.00,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21608,263000018,20140521T000000,360000.0,3,2.50,1530,1131,3.0,0,0,3,8,1530,0,2009,0,98103,47.6993,-122.346,1530,1509
21609,6600060120,20150223T000000,400000.0,4,2.50,2310,5813,2.0,0,0,3,8,2310,0,2014,0,98146,47.5107,-122.362,1830,7200
21610,1523300141,20140623T000000,402101.0,2,0.75,1020,1350,2.0,0,0,3,7,1020,0,2009,0,98144,47.5944,-122.299,1020,2007
21611,291310100,20150116T000000,400000.0,3,2.50,1600,2388,2.0,0,0,3,8,1600,0,2004,0,98027,47.5345,-122.069,1410,1287


### Taking a subset of the data

In [14]:
list_cols = ['price', 'bedrooms', 'bathrooms', 'sqft_living'] ## we will consider those columns only
df2 = df[list_cols].copy()
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   price        21613 non-null  float64
 1   bedrooms     21613 non-null  int64  
 2   bathrooms    21613 non-null  float64
 3   sqft_living  21613 non-null  int64  
dtypes: float64(2), int64(2)
memory usage: 675.5 KB


In [62]:
import numpy as np
m_p=0.1

In [72]:
# Create a boolean mask to introduce missing values into the DataFrame.
# Generate random values between 0 and 1 for each element in the DataFrame shape.
# Each element of the mask is set to True with a probability determined by the missing_percentage.
mask = np.random.rand(*df_.shape) < m_p
# Use the generated mask to replace corresponding elements in the DataFrame with NaN (missing values).
df2[mask] = np.nan
# Display the shape of the DataFrame to show the number of rows and columns after introducing missing values.
df2.shape

(21613, 4)

In [64]:
df2.tail()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living
21608,360000.0,3.0,2.0,1530.0
21609,400000.0,4.0,2.0,2310.0
21610,,2.0,2.0,1020.0
21611,400000.0,,,1600.0
21612,325000.0,2.0,1.0,1020.0


In [65]:
# Remove rows with missing values (NaN) from the DataFrame df2.
##df2_cleaned = df2.dropna()

# Display the number of rows remaining in the cleaned DataFrame.
##print(len(df2_cleaned))


### Filling the missing values 

In [66]:
df.describe()


Unnamed: 0,id,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
count,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0,21613.0
mean,4580302000.0,540182.2,3.370842,2.058715,2079.899736,15106.97,1.494309,0.007542,0.234303,3.40943,7.656873,1788.390691,291.509045,1971.005136,84.402258,98077.939805,47.560053,-122.213896,1986.552492,12768.455652
std,2876566000.0,367362.2,0.930062,0.755524,918.440897,41420.51,0.539989,0.086517,0.766318,0.650743,1.175459,828.090978,442.575043,29.373411,401.67924,53.505026,0.138564,0.140828,685.391304,27304.179631
min,1000102.0,75000.0,0.0,0.0,290.0,520.0,1.0,0.0,0.0,1.0,1.0,290.0,0.0,1900.0,0.0,98001.0,47.1559,-122.519,399.0,651.0
25%,2123049000.0,321950.0,3.0,2.0,1427.0,5040.0,1.0,0.0,0.0,3.0,7.0,1190.0,0.0,1951.0,0.0,98033.0,47.471,-122.328,1490.0,5100.0
50%,3904930000.0,450000.0,3.0,2.0,1910.0,7618.0,1.5,0.0,0.0,3.0,7.0,1560.0,0.0,1975.0,0.0,98065.0,47.5718,-122.23,1840.0,7620.0
75%,7308900000.0,645000.0,4.0,2.0,2550.0,10688.0,2.0,0.0,0.0,4.0,8.0,2210.0,560.0,1997.0,0.0,98118.0,47.678,-122.125,2360.0,10083.0
max,9900000000.0,7700000.0,33.0,8.0,13540.0,1651359.0,3.5,1.0,4.0,5.0,13.0,9410.0,4820.0,2015.0,2015.0,98199.0,47.7776,-121.315,6210.0,871200.0


In [73]:
for column in important_columns:
    if df2[column].isnull().any():
        # Count the number of missing values
        missing_values = df2[column].isnull().sum()

        # Print the number of missing values
        print(f"There are {missing_values} missing values in the {column} column.")

        # Fill missing values with the mean
        df2[column].fillna(df2[column].mean(), inplace=True)


There are 2237 missing values in the price column.
There are 2072 missing values in the bedrooms column.
There are 2202 missing values in the bathrooms column.
There are 2107 missing values in the sqft_living column.
