<a href="https://colab.research.google.com/github/Tarane2028/ADS500B/blob/main/Assignment_1_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1.1.1. Difference between Data Science and Data Mining

Data Science:
Data Science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It encompasses various techniques from statistics, data analysis, machine learning, and domain knowledge to analyze and interpret complex data.

Data Mining:
Data Mining is a subset of Data Science that focuses on discovering patterns, correlations, and anomalies in large datasets. It involves methods at the intersection of machine learning, statistics, and database systems to extract useful information from data.

1.1.2. Difference between AI, Machine Learning, and Deep Learning

Artificial Intelligence (AI):
AI is a broad field of computer science focused on creating systems capable of performing tasks that typically require human intelligence. These tasks include reasoning, learning, problem-solving, perception, language understanding, and more.

Machine Learning (ML):
ML is a subset of AI that involves the use of algorithms and statistical models to enable computers to improve their performance on a specific task through experience. ML algorithms learn patterns from data and make predictions or decisions without being explicitly programmed.

Deep Learning (DL):
DL is a specialized subset of ML that uses neural networks with many layers (deep neural networks) to model and understand complex patterns in data. DL excels in tasks like image and speech recognition, natural language processing, and more, due to its ability to learn hierarchical representations of data.

1.1.3. Difference between Supervised and Unsupervised Learning

Supervised Learning:
Supervised learning involves training a model on labeled data, where the input data is paired with the correct output. The model learns to make predictions or decisions based on this training data. An example is a spam email classifier that is trained on emails labeled as "spam" or "not spam."

Unsupervised Learning:
Unsupervised learning involves training a model on unlabeled data, where the algorithm tries to identify patterns and relationships within the data without predefined labels. An example is customer segmentation, where the goal is to group customers based on purchasing behavior without prior knowledge of the group labels.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Load dataset
data_url = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv"
data = pd.read_csv(data_url)

# Exploratory Data Analysis
print(data.head())
print(data.describe())
print(data.info())

# Feature engineering: adding new features
data['rooms_per_household'] = data['total_rooms'] / data['households']
data['bedrooms_per_room'] = data['total_bedrooms'] / data['total_rooms']
data['population_per_household'] = data['population'] / data['households']

# Prepare the data for Machine Learning algorithms
X = data.drop("median_house_value", axis=1)
y = data["median_house_value"].copy()

# Data preprocessing: handling missing values
imputer = SimpleImputer(strategy="median")
X_num = X.select_dtypes(include=[np.number])
X_num_imputed = pd.DataFrame(imputer.fit_transform(X_num), columns=X_num.columns)

# Convert categorical attribute "ocean_proximity" to one-hot vectors
X_cat = pd.get_dummies(X["ocean_proximity"], drop_first=True)

# Combine numerical and categorical features
X_prepared = pd.concat([X_num_imputed, X_cat], axis=1)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_prepared, y, test_size=0.2, random_state=42)

# Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a linear regression model
model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
print(f'Root Mean Squared Error: {rmse}')

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  
          longitude      latitude  housing_median_age   total_rooms  \
coun

Libraries such as NumPy, pandas, and scikit-learn are essential for data manipulation, model training, and evaluation. Matplotlib and seaborn are used for visualization.
The dataset is loaded from a URL into a pandas DataFrame. This dataset contains various features related to housing.
EDA helps in understanding the dataset by displaying the first few rows, statistical summary, and data types. This step is crucial for identifying any missing values or anomalies.
New features are created from existing ones to help the model capture more relevant information. Examples include 'rooms_per_household' and 'bedrooms_per_room'.
The target variable 'median_house_value' is separated from the features. Missing values in numerical features are handled using the SimpleImputer with a median strategy.
The 'ocean_proximity' categorical attribute is converted to one-hot encoded vectors to be used in the model.
Numerical and categorical features are combined into a single DataFrame for model training.
The dataset is split into training and testing sets to evaluate the model's performance on unseen data.
Features are scaled to have zero mean and unit variance, which helps many machine learning algorithms perform better.
A linear regression model is trained on the scaled training data.
The trained model is used to make predictions on the scaled test set. The root mean squared error (RMSE) is calculated to measure the model's performance. Lower RMSE indicates better model performance.