# The first part of the assignment requires you to perform multivariate regression to estimate the pricing of house using the availability facility information as features.

**A peek into the dataset :**

 The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following describes the dataset columns:

* CRIM - per capita crime rate by town
* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS - proportion of non-retail business acres per town.
* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX - nitric oxides concentration (parts per 10 million)
* RM - average number of rooms per dwelling
* AGE - proportion of owner-occupied units built prior to 1940
* DIS - weighted distances to five Boston employment centres
* RAD - index of accessibility to radial highways
* TAX - full-value property-tax rate per \$10,000
* PTRATIO - pupil-teacher ratio by town
* B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT - % lower status of the population
* MEDV - Median value of owner-occupied homes in $1000's

MEDV is the dependent variable.


 Using this dataset, explain your understanding of linear regression. You should do some checks on the features and the dependent variable, get some plots and distributions for the given variables.

In [None]:
!pip install scikit-learn==1.0.2



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

%matplotlib inline

from sklearn.datasets import load_boston
import seaborn as sns

In [None]:
# Load the dataset by calling the load_boston function
boston = load_boston()


    The Boston housing prices dataset has an ethical problem. You can refer to
    the documentation of this function for further details.

    The scikit-learn maintainers therefore strongly discourage the use of this
    dataset unless the purpose of the code is to study and educate about
    ethical issues in data science and machine learning.

    In this special case, you can fetch the dataset from the original
    source::

        import pandas as pd
        import numpy as np


        data_url = "http://lib.stat.cmu.edu/datasets/boston"
        raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
        data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
        target = raw_df.values[1::2, 2]

    Alternative datasets include the California housing dataset (i.e.
    :func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
    dataset. You can load the datasets as follows::

        from sklearn.datasets import fetch_california_h

In [None]:
# Find the keys of all features in the dataset
boston.feature_names
print(boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [None]:
# Find the type of the data
data=boston.data
# Fill
print(type(data))

<class 'numpy.ndarray'>


In [None]:
data.shape
# this 506 represents the number of rows and 13 represents the number of columns in the boston dataset

(506, 13)

In [None]:
# Load the data into a pandas dataframe
data_df = pd.DataFrame(data=data, columns=boston.feature_names)
# Print the head of the dataframe to see how it is
data_df['target'] = pd.Series(boston.target)

# Fill
print(data_df.head())


      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  target  
0     15.3  396.90   4.98    24.0  
1     17.8  396.90   9.14    21.6  
2     17.8  392.83   4.03    34.7  
3     18.7  394.63   2.94    33.4  
4     18.7  396.90   5.33    36.2  


In [None]:
# sklearn datasets have the y value as boston.target. Load the y values as a new column data_df['PRICE']
# Fill
y = boston.target
data_df['PRICE'] = y
data_df.head()
# this y is basically now my training dataset

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,36.2


In [None]:
# Visualise the data frame using the .describe() function
# Fill
print(data_df.describe())


In [None]:
# Visualise the data frame using the .info() function
# Fill
print(data_df.info())

In [None]:
# Create the train test split using sklearn function
x = data_df.drop(columns = ['PRICE', 'target'])
y = data_df['PRICE']
# Fill
from sklearn.model_selection import train_test_split
x_train, y_train, x_test, y_test = train_test_split(x,y,test_size=0.1)


In [None]:
# Load the Linear Regression model
# Fill
from sklearn.linear_model import LinearRegression
reg = LinearRegression()


In [None]:
# Fit the loaded model using the train dataset
# Fill

reg.fit(x_train, y_train)
# don't know why it's showing data inconsistent with size of dataframe.. some data lost from the file?

In [None]:
# Predict the y values corresponding to test x values
# Fill

y_predicted = reg.predict(x_test)

In [None]:
from sklearn.metrics import mean_squared_error
# Find the mse error using the sklearn function between the test y values and predicted y values corresponding to test x values
# Fill
print(mean_squared_error(y_test, y_predicted))


# Second Part : Implementing Logistic Regression

Here you will have to implement a model to predict if a person will buy a product or not given their age.

In [None]:
# Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from math import exp
plt.rcParams["figure.figsize"] = (10, 6)

In [None]:
# Download the dataset
# Source of dataset - https://www.kaggle.com/rakeshrau/social-network-ads
!wget "https://drive.google.com/uc?id=15WAD9_4CpUK6EWmgWVXU8YMnyYLKQvW8&export=download" -O data.csv -q

In [None]:
# Load the dataset using load_csv function and visualize the data

def load_csv(https://drive.google.com/uc?id=15WAD9_4CpUK6EWmgWVXU8YMnyYLKQvW8&export=download):
    with pd.read_csv(https://drive.google.com/uc?id=15WAD9_4CpUK6EWmgWVXU8YMnyYLKQvW8&export=download , encoding='utf-8') as df:
        return df

df = load_csv(https://drive.google.com/uc?id=15WAD9_4CpUK6EWmgWVXU8YMnyYLKQvW8&export=download)


In [None]:
# Try plotting a scatter plot of the data
%matplotlib inline
plt.scatter(df.User ID, df.Purchased, color = 'red')

In [None]:
# Create the train test split
x = df[['Age']]
y = df['Purchased']
x_train, y_train, x_test, y_test = train_test_split(x,y, test_size= 0.15, )


In [None]:
# Import the logistic regression model from sklearn and initialize it
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [None]:
# Fit the model using the training samples
logreg.fit(x_train, y_train)

In [None]:
# Predict the y values for test_x values
y_pred = logreg.predict(x_test)

In [None]:
# Calculate the accuracy of the model using sklearn function
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)