<h1>Importing Dataset - Laptop Pricing</h1>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression

In [None]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_base.csv"
df=pd.read_csv(file_path)
df.to_csv("laptop.csv")

In [None]:
df.head(10)

<h3>Assigning header to dataframe</h3>

In [None]:
headers=["Manufacturer", "Category", "Screen", "GPU", "OS", "CPU_core", "Screen_Size_cm", "CPU_frequency", "RAM_GB", "Storage_GB_SSD", "Weight_kg" ,"Price"]
df.columns=headers

In [None]:
df.head()

In [None]:
#checking data type of each column
df.dtypes

In [None]:
#statistical description of the dataset, including that of 'object' data types
df.describe(include=["object"])

<h1>Handling missing data</h1>

In [None]:
missing_data = df.isnull()
print(missing_data.head())
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("") 

<h3>Replacing "?" with np.nan</h3>

In [None]:
df.replace("?",np.nan,inplace=True)

In [None]:
df.head()

In [None]:
#replacing the missing values of weight with the average value of the attribute.
avg_weight=df['Weight_kg'].astype('float').mean(axis=0)
df["Weight_kg"].replace(np.nan, avg_weight, inplace=True)

In [None]:
# replacing the missing values of Screen Size with the most frequent value of the attribute.
common_screen_size = df['Screen_Size_cm'].value_counts().idxmax()
df["Screen_Size_cm"].replace(np.nan, common_screen_size, inplace=True)

In [None]:
df.head()

Both "Weight_kg" and "Screen_Size_cm" are seen to have the data type "Object", while both of them should be having a data type of "float". 

In [None]:
df[["Weight_kg","Screen_Size_cm"]] = df[["Weight_kg","Screen_Size_cm"]].astype("float")

<h3>Data Standardization</h3>

In [None]:
# Data standardization: convert weight from kg to pounds
df["Weight_kg"] = df["Weight_kg"]*2.205
df.rename(columns={'Weight_kg':'Weight_pounds'}, inplace=True)

# Data standardization: convert screen size from cm to inch
df["Screen_Size_cm"] = df["Screen_Size_cm"]/2.54
df.rename(columns={'Screen_Size_cm':'Screen_Size_inch'}, inplace=True)

<h3>Data Normalization</h3>

In [None]:
df['CPU_frequency'] = df['CPU_frequency']/df['CPU_frequency'].max()

In [None]:
df.head()

<h3>Binning</h3>
<p>Creating 3 bins for the attribute "Price" named "Low", "Medium" and "High". The new attribute will be named "Price-binned".</p>

In [None]:
bins = np.linspace(min(df["Price"]), max(df["Price"]), 4)
group_names = ['Low', 'Medium', 'High']
df['Price-binned'] = pd.cut(df['Price'], bins, labels=group_names, include_lowest=True )

In [None]:
plt.bar(group_names, df["Price-binned"].value_counts())
plt.xlabel("Price")
plt.ylabel("count")
plt.title("Price bins")

<p>Converting the "Screen" attribute of the dataset into 2 indicator variables, "Screen-IPS_panel" and "Screen-Full_HD" and drop the "Screen" attribute from the dataset.</p>

In [None]:
dummy_variable_1 = pd.get_dummies(df["Screen"])
dummy_variable_1.rename(columns={'IPS Panel':'Screen-IPS_panel', 'Full HD':'Screen-Full_HD'}, inplace=True)
df = pd.concat([df, dummy_variable_1], axis=1)

# drop original column "Screen" from "df"
df.drop("Screen", axis = 1, inplace=True)

In [None]:
df.head(10)

<h1>Exploring Data Analysis</h1>

<h4>Generating regression plots for each of the parameters "CPU_frequency", "Screen_Size_inch" and "Weight_pounds" against "Price" and calculating the value of correlation of each feature with "Price".</h4>

In [None]:
# CPU_frequency plot
sns.regplot(x="CPU_frequency", y="Price", data=df)
plt.ylim(0,)
print(f"Correlation of Price and CPU_frequency")
df[["CPU_frequency","Price"]].corr()

In [None]:
# CPU_frequency plot
sns.regplot(x="Screen_Size_inch", y="Price", data=df)
plt.ylim(0,)
print(f"Correlation of Price and Screen_Size_inch")
df[["Screen_Size_inch","Price"]].corr()

In [None]:
# CPU_frequency plot
sns.regplot(x="Weight_pounds", y="Price", data=df)
plt.ylim(0,)
print(f"Correlation of Price and Weight_pounds ")
df[["Weight_pounds","Price"]].corr()

<p>Observation: "CPU_frequency" has a 36% positive correlation with the price of the laptops. The other two parameters have weak correlation with price.</p>

<h4>Generating Box plots for the different feature that hold categorical values. These features would be "Category", "GPU", "OS", "CPU_core", "RAM_GB", "Storage_GB_SSD"</h4>

In [None]:
sns.boxplot(x="Category", y="Price", data=df)

In [None]:
sns.boxplot(x="GPU", y="Price", data=df)

In [None]:
sns.boxplot(x="OS", y="Price", data=df)

In [None]:
sns.boxplot(x="CPU_core", y="Price", data=df)

In [None]:
sns.boxplot(x="RAM_GB", y="Price", data=df)

In [None]:
sns.boxplot(x="Storage_GB_SSD", y="Price", data=df)

<h4> Evaluate the Pearson Coefficient and the p-values for each parameter tested above. This will help you determine the parameters most likely to have a strong effect on the price of the laptops.</h4>

In [None]:
from scipy import stats
for param in ['RAM_GB','CPU_frequency','Storage_GB_SSD','Screen_Size_inch','Weight_pounds','CPU_core','OS','GPU','Category']:
    pearson_coef, p_value = stats.pearsonr(df[param], df['Price'])
#     print(param)
    print("Pearson Correlation Coefficient for ",param," is", pearson_coef, " with a P-value of P =", p_value)

<p>Based on the Pearson correlation coefficients and their associated p-values, the parameters most likely to have a strong effect on the price of laptops are RAM_GB, CPU_frequency, CPU_core, GPU, and Category </p>

<h1>Model Development</h1>

<h4>Single Linear Regression model</h4>

In [None]:

lm = LinearRegression()

X = df[['CPU_frequency']]
Y = df['Price']

lm.fit(X,Y)

Yhat=lm.predict(X)

<p>Generating the Distribution plot for the predicted values and that of the actual values. </p>

In [None]:
ax1 = sns.distplot(df['Price'], hist=False, color="r", label="Actual Value")

# Create a distribution plot for predicted values
sns.distplot(Yhat, hist=False, color="b", label="Fitted Values" , ax=ax1)

plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price')
plt.ylabel('Proportion of laptops')
plt.legend(['Actual Value', 'Predicted Value'])
plt.show()

<p>Evaluating the Mean Squared Error and R^2 score values for the model.</p>

In [None]:

mse_slr = mean_squared_error(df['Price'], Yhat)
r2_score_slr = lm.score(X, Y)
print('The R-square for Linear Regression is: ', r2_score_slr)
print('The mean square error of price and predicted value is: ', mse_slr)

<h3>Multiple Linear Regression</h3>

In [None]:
lm1 = LinearRegression()
Z = df[['CPU_frequency','RAM_GB','Storage_GB_SSD','CPU_core','OS','GPU','Category']]
lm1.fit(Z,Y)
Y_predict_multifit = lm1.predict(Z)

<p>Plot the Distribution graph of the predicted values as well as the Actual values</p>

In [None]:
ax1 = sns.distplot(df['Price'], hist=False, color="r", label="Actual Value")
sns.distplot(Y_hat, hist=False, color="b", label="Fitted Values" , ax=ax1)

plt.title('Actual vs Fitted Values for Price')
plt.xlabel('Price')
plt.ylabel('Proportion of laptops')

<p>Evaluating the Mean Squared Error and R^2 score values for the model.</p>

In [None]:
mse_slr = mean_squared_error(df['Price'], Y_predict_multifit)
r2_score_slr = lm1.score(Z, Y)
print('The R-square for Linear Regression is: ', r2_score_slr)
print('The mean square error of price and predicted value is: ', mse_slr)

<h3>Polynomial Regression</h3>


In [None]:
X = X.to_numpy().flatten()
f1 = np.polyfit(X, Y, 1)
p1 = np.poly1d(f1)

f3 = np.polyfit(X, Y, 3)
p3 = np.poly1d(f3)

f5 = np.polyfit(X, Y, 5)
p5 = np.poly1d(f5)

<p>Plot the regression output against the actual data points to note how the data fits in each case. </p>

In [None]:
def PlotPolly(model, independent_variable, dependent_variabble, Name):
    x_new = np.linspace(independent_variable.min(),independent_variable.max(),100)
    y_new = model(x_new)

    plt.plot(independent_variable, dependent_variabble, '.', x_new, y_new, '-')
    plt.title(f'Polynomial Fit for Price ~ {Name}')
    ax = plt.gca()
    ax.set_facecolor((0.898, 0.898, 0.898))
    fig = plt.gcf()
    plt.xlabel(Name)
    plt.ylabel('Price of laptops')

<p>Calling this function for the 3 models created and get the required graphs.</p>

In [None]:
PlotPolly(p1, X, Y, 'CPU_frequency')

In [None]:
PlotPolly(p3, X, Y, 'CPU_frequency')

PlotPolly(p5, X, Y, 'CPU_frequency')

<p>Calculate the R^2 and MSE values for these fits. </p>

In [None]:
r_squared_1 = r2_score(Y, p1(X))
print('The R-square value for 1st degree polynomial is: ', r_squared_1)
print('The MSE value for 1st degree polynomial is: ', mean_squared_error(Y,p1(X)))
r_squared_3 = r2_score(Y, p3(X))
print('The R-square value for 3rd degree polynomial is: ', r_squared_3)
print('The MSE value for 3rd degree polynomial is: ', mean_squared_error(Y,p3(X)))
r_squared_5 = r2_score(Y, p5(X))
print('The R-square value for 5th degree polynomial is: ', r_squared_5)
print('The MSE value for 5th degree polynomial is: ', mean_squared_error(Y,p5(X)))

<p>Conclusion :Based on the R-squared values and the MSE values, the Multiple Linear Regression model appears to be the best fit among the models. It has the highest R-squared value, indicating that it explains more variance in the data, and the lowest MSE, indicating better predictive accuracy compared to the other models.</p>