### LoanData
following quiz task N1, N2, N3 works with **Housing_Data.csv** which is from (kaggle)[https://www.kaggle.com/code/rahulvani09/house-price-prediction] 
data is in comma seperated values (CSV) and the fields represent (most of them are yes or no so they will be interpreted as 0, 1):

- **price**: price of the house in dollars
- **area**: code of the area (country unknown)
- **bedrooms**: number of bedrooms
- **bathrooms:** number of bathrooms
- **stories**: how many stories does the house have
- **mainroad**: is next to main road (yes or no)
- **guestroom**: has a guest room (yes or no)
- **basement**: has a basement (yes or no)
- **hotwaterheating**: has hot water heating (yes or no)
- **airconditioning**: has air conditioning (yes or no)
- **parking**: parking (yes or no)
- **prefarea**: (idk can't figure out what it was, not in the description, but yes or no dd)
- **furnishingstatus**: furnishing status in three categories (furnished, semi-furnished, unfurnished)

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

#### everything other then purpose column is numeric data, to work with the data properly turn purpose column
#### into appropriate type

In [3]:
df = pd.read_csv("Housing_Data.csv")
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [4]:
# lets see if we have anu null values
df.isnull().sum()
# since there are no nulls, we can work on the dataset freely

price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64

##### most of the columns that are yes or no will be repalce by 1 or 0

In [5]:
columns_to_replace = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']

df[columns_to_replace] = df[columns_to_replace].map(lambda x: 1 if x == 'yes' else 0)

In [6]:
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,furnished
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,furnished
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,semi-furnished
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,furnished
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,furnished


##### we have one categorical column, furnishingstatus, since there are three categories, add three columns for each 
##### one of them and put 0, 1 respectively

In [7]:
df = pd.get_dummies(df, columns=['furnishingstatus'])

In [8]:
columns_to_replace = ['furnishingstatus_semi-furnished', 'furnishingstatus_furnished', 'furnishingstatus_unfurnished']
df[columns_to_replace] = df[columns_to_replace].map(lambda x: 1 if x is True else 0)
df

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus_furnished,furnishingstatus_semi-furnished,furnishingstatus_unfurnished
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,1,0,0
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,1,0,0
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,0,1,0
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,1,0,0
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
540,1820000,3000,2,1,1,1,0,1,0,0,2,0,0,0,1
541,1767150,2400,3,1,1,0,0,0,0,0,0,0,0,1,0
542,1750000,3620,2,1,1,1,0,0,0,0,0,0,0,0,1
543,1750000,2910,3,1,1,0,0,0,0,0,0,0,1,0,0


#### Simple linear regression, based on number of bedrooms

##### data is now clean, let's start by using simple linear regression, we will choose number of bedrooms as our feature
##### and train the model to predict the price of the house

In [9]:
df2 = df.copy()

In [10]:
X = df2[['bedrooms']]
Y = df2['price']

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=42)

In [11]:
# create linear regression model
model = LinearRegression()
model.fit(X_train, Y_train)

# evaluate model
train_score = model.score(X_train, Y_train)
test_score = model.score(X_test, Y_test)

print("Training Score with single variable:", train_score)
print("Testing Score with single variable:", test_score)

Training Score with single variable: 0.13788014956136474
Testing Score with single variable: 0.10054871227244688


#### to make use of all data and create more advanced model, lets use multiple linear regression

In [12]:
# features and target variable
# we will be guessing price of the houses
X = df.drop("price", axis=1)
y = df["price"]

# use 90% of the data for training, 20 for testing, we can divide them using the train_test_split function
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [13]:

# create linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# evaluate model
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print("Training Score:", train_score)
print("Testing Score:", test_score)

Training Score: 0.6784744206118938
Testing Score: 0.6875282421536384


# Results

Linear regression model results for single variable are quite low, there can be multiple reasons for that, first of all number of features is not nearly enough, after we used mutiple linear regression we can see that the result is getting significantly better, it is in 70%s range.

# Decision Tree Regression

In [14]:
from sklearn.tree import DecisionTreeRegressor

In [15]:
X = df.drop('price', axis=1)
y = df['price']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [16]:
# Create and fit the model
model = DecisionTreeRegressor()
model.fit(X_train, y_train)

# Evaluate the model
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)

print("Training Score:", train_score)
print("Testing Score:", test_score)

Training Score: 0.9985402884288594
Testing Score: 0.3912065371708572


# Logistic Regression

for logistic regression model and decision tree classifier I will be using lung cancer dataset from kaggle. our main goal is to find out whether or not a person will have
a lung cancer based on the data provided, so this is a great example of regression.

In [17]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [18]:
df = pd.read_csv("survey_lung_cancer.csv")
df.head()

Unnamed: 0,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER,GENDER
0,69,1,2,2,1,1,2,1,2,2,2,2,2,2,YES,M
1,74,2,1,1,1,2,2,2,1,1,1,2,2,2,YES,M
2,59,1,1,1,2,1,2,1,2,1,2,2,1,2,NO,F
3,63,2,2,2,1,1,1,1,1,2,1,1,2,2,NO,M
4,63,1,2,1,1,1,1,1,2,1,2,2,1,1,NO,F


In [19]:
# clean up data make everything numeric
df.replace({1: 0, 2: 1}, inplace=True)
df['LUNG_CANCER']= df["LUNG_CANCER"].map({"NO": 0,"YES": 1})
df['GENDER'] = df["GENDER"].map({"M": 0, "F": 1})

In [20]:
df.head()

Unnamed: 0,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,PEER_PRESSURE,CHRONIC DISEASE,FATIGUE,ALLERGY,WHEEZING,ALCOHOL CONSUMING,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER,GENDER
0,69,0,1,1,0,0,1,0,1,1,1,1,1,1,1,0
1,74,1,0,0,0,1,1,1,0,0,0,1,1,1,1,0
2,59,0,0,0,1,0,1,0,1,0,1,1,0,1,0,1
3,63,1,1,1,0,0,0,0,0,1,0,0,1,1,0,0
4,63,0,1,0,0,0,0,0,1,0,1,1,0,0,0,1


In [21]:
X = df.drop('LUNG_CANCER', axis=1)
y = df['LUNG_CANCER']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [22]:
model = LogisticRegression(max_iter=10000)
model.fit(X_train, y_train)

# Predict on the test data
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

Accuracy: 0.967741935483871


# Decision Tree Classifier

In [23]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [27]:
X = df.drop('LUNG_CANCER', axis=1)
y = df['LUNG_CANCER']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=41)

In [28]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

# Predict on the test data
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)

print("Accuracy:", accuracy)

Accuracy: 0.8064516129032258
