# Scenario 

Google play store is the hub to download all the apps on the device whether it is a laptop or mobile phone. When we download app from google play store, on screen page some sort of information's like reviews, ratings, type of app in playstore etc characteristics are there.

### Objective
By using different ML algorithms, we have to predict total number of downloads of a particular app from google play store. We have to predict the total number of downloads based on different features.

Dataset: 10841 rows, 13 columns


### Questions to answer

1. Identify the type of models which you will prefer to make for this project.
2. Check whether there are null values or not in the dataset. If null values are there then do
the null value imputation.
3. Check the presence of outliers. Use Boxplot as well suitable mathematical method to
detect, If outliers are there for particular features then decide whether you will treat them
or not. If yes, do the treatment.
4. Check for multicollinearity. If it is there do the necessary treatment.
5. Do for successful model building, it requires scaling. If yes, how you encounter it.
Explain
6. Prepare at-least 4 models for this problem statement.
7. Evaluate your models and select it based on different evaluation parameters, Write
significance also.
8. Identify the features which you think are the most needed for good installments
prediction.
9. Predict installments for at-least 10 data points.
10. Mention the business scope of this project.

In [None]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np 
import pandas as pd 
import seaborn as sns
import os 

In [None]:
df=pd.read_csv('/kaggle/input/google-play-store-installation-prediction/googleplaystore.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.isna().sum()

### Null values imputation in Rating column


In [None]:
df['Rating']

In [None]:
# Replace NaN values with the mean of the Rating column
mean_rating = df['Rating'].mean()
df['Rating'] = df['Rating'].fillna(mean_rating)

In [None]:
df['Rating'].isnull().sum()

### Convert each predictors (independent variables) into int/float

### Begin with 'Reviews' column

In [None]:
# Reviews column cardinality check
df['Reviews'].nunique()

In [None]:
df[df['Reviews'].str.endswith('M')]

In [None]:
#this is a case of column mismatch. so remove entire row.
df = df.drop(10472)

In [None]:
df['Reviews'] = df['Reviews'].astype(float)

In [None]:
df.info()

### Convert 'Size' column

In [None]:
df.head() 

In [None]:
df.tail()

In [None]:
# Remove 'M' from the 'Size' column entries
df['Size']= df['Size'].str.replace('M', '')

In [None]:
# We need to convert 'Varies with device'. First calculate total entries in Size column with 'Varies with device'

count_varies = len(df[df['Size'] == 'Varies with device'])
print("Total 'Varies with device' :", count_varies)

Total 'Varies with device' rows: 1695. This is a significant size out of 10840, hence cannot be removed. 

In [None]:
#We also have one row with suffix 'k' meaning kilobyte and we are dealing the most with Megabytes 

In [None]:
# Remove 'k' and multiply by 0.001 for entries suffixed with 'k'
df['Size'] = df['Size'].apply(lambda x: float(x.replace('k', ''))*0.001 if 'k' in x else x)

In [None]:
#Check total entries containing 'k' as suffix in Size 
df['Size'].str.contains('k').sum()

In [None]:
# Replace 'Varies with device' with null values
df['Size'] = df['Size'].replace('Varies with device', np.nan)

In [None]:
df['Size']

In [None]:
df['Size'] = df['Size'].astype(float)

In [None]:
#Replace Nan with mean in Size column
mean_size = df['Size'].mean()
df['Size'] = df['Size'].fillna(mean_size)

In [None]:
df['Size'].isnull().sum()

In [None]:
df.info()

### Now begin with 'Installs' column

In [None]:
df.head()

In [None]:
# Remove '+' suffix from all entries of Installs column
df['Installs'] = df['Installs'].str.replace('+', '')

In [None]:
df['Installs'] = df['Installs'].str.replace(',', '')

In [None]:
# Now try converting the datatype to float
df['Installs'] = df['Installs'].astype(float)

In [None]:
df.info()

### Now begin with 'Price' colum

In [None]:
df['Price'].unique()

In [None]:
# Remove $ sign 
df['Price'] = df['Price'].str.replace('$', '')

In [None]:
# Convert datatype
df['Price'] = df['Price'].astype(float)

In [None]:
df.info()

In [None]:
df.isnull().sum()

### Now analyse 'Content Rating' column

In [None]:
df['Content Rating'].unique()

### Check distribution of numeric columns

In [None]:
df.describe()

In [None]:
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns.tolist()
numeric_columns

In [None]:
# Plot histograms of the numeric columns
df[numeric_columns].hist(bins=20, figsize=(10, 10))
plt.show()


### Plot boxplots to check outliers

In [None]:
# Rating

df.boxplot(column='Rating', vert=False)
plt.title('Box Plot of Rating')
plt.xlabel('Rating')

plt.show()

In [None]:
# Reviews
df.boxplot(column='Reviews', vert=False)
plt.title('Box Plot of Reviews')
plt.xlabel('Reviews')
plt.show()

In [None]:
# Size
df.boxplot(column='Size', vert=False)
plt.title('Box Plot of Size')
plt.xlabel('Size')
plt.show()

In [None]:
# Price
df.boxplot(column='Price', vert=False)
plt.title('Box Plot of Price')
plt.xlabel('Price')
plt.show()

### Independent Variables: 'Rating', 'Reviews', 'Size', 'Price'

In [None]:
df.describe()

Since max value for *Rating* is withing 1-5 range, it has no outliers.

### Calculate Corelation Matrix 

In [None]:
# Calculate the correlation matrix
corr_matrix = df[['Rating', 'Reviews', 'Size', 'Price']].corr()

# Print the correlation matrix
corr_matrix

In [None]:
#Plot heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

##### Since the correlation values are very low, none of these values are perfectly correlated to each other. 
Hence Multicolinearity does not exists here.

## Drop useless columns from DF

In [None]:
df.head(5)

In [None]:
df['Content Rating'].value_counts()


In [None]:
df['Genres'].value_counts()


In [None]:
df['Category'].value_counts()

### Drop 	Genres 	Last Updated 	Current Ver 	Android Ver and Category Columns

In [None]:
df1= df.drop([ 'Genres', 'Last Updated', 'Current Ver', 'Android Ver', 'Category'], axis=1)


In [None]:
df1

### Dealing with Categorical Columns

In [None]:
df1['Content Rating'].value_counts()

# Exploratory Data Analysis

In [None]:
df.select_dtypes(include=['object']).columns

In [None]:
df['App'].nunique()

In [None]:
# plot the frequency of each category of 'Category' column
fig = plt.figure(figsize=(15, 6))

df['Category'].value_counts().plot(kind='bar')
plt.title('Frequency of Categories')
plt.xlabel('Category')
plt.ylabel('Frequency')
plt.show()

In [None]:
# plot the frequency of each category of 'Type' column

df['Type'].value_counts().plot(kind='bar')
plt.title('Frequency of Type')
plt.xlabel('Type')
plt.ylabel('Frequency')
plt.show()

In [None]:
# plot the frequency of each category of 'Genres' column
fig = plt.figure(figsize=(25, 9))

df['Genres'].value_counts().plot(kind='bar')
plt.title('Frequency of Genres')
plt.xlabel('Genres')
plt.ylabel('Frequency')
plt.show()

In [None]:
# plot the frequency of each category of 'Content Rating' column
df['Content Rating'].value_counts().plot(kind='bar')
plt.title('Frequency of Content Rating')
plt.xlabel('Content Rating')
plt.ylabel('Frequency')
plt.show()

In [None]:
df1.info()

## One-hot encoding of categorical columns


In [None]:
df_encoded = pd.get_dummies(df1, columns=['Content Rating', 'Type'])
df_encoded.head()

### Perform feature scaling of Reviews, Size Columns

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaler.fit(df_encoded[['Reviews', 'Size']])

# Scale the Reviews and Size columns in the dataframe
df_encoded[['Reviews', 'Size']] = scaler.transform(df_encoded[['Reviews', 'Size']])

In [None]:
df_encoded.head()

In [None]:
df_encoded.describe()

# 1. Linear Rigression

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df_encoded.drop(['App', 'Installs'], axis=1) # predictor variables
y = df_encoded['Installs'] # target variable


In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict the target variable using the test data
y_pred = lr.predict(X_test)

In [None]:
# Evaluate the model using mean squared error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean squared error:", mse)
print("R-squared value:", r2)

In [None]:
# Print the coefficients
coefficients = pd.DataFrame({'Variable': X.columns, 'Coefficient': lr.coef_})
print(coefficients)

In [None]:
# Observe the errors
errors = y_test - y_pred

# Create a kernel density plot of the errors
sns.kdeplot(errors)
plt.xlabel('Error')
plt.ylabel('Density')
plt.title('Distribution of Errors')
plt.figure(figsize=(20, 4))
sns.set(font_scale=1.5)
plt.show()

# 2.  Polynomial Regression

In [None]:
from sklearn.preprocessing import PolynomialFeatures

In [None]:
X = df_encoded.drop(['App', 'Installs'], axis=1) # predictor variables
y = df_encoded['Installs'] # target variable

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

poly_model = LinearRegression()


In [None]:
poly_model.fit(X_train_poly, y_train)

In [None]:
y_pred_poly = poly_model.predict(X_test_poly)

In [None]:
# Evaluate the model using mean squared error and R-squared
mse_poly = mean_squared_error(y_test, y_pred_poly)
r2_poly = r2_score(y_test, y_pred_poly)

print("Mean squared error:", mse)
print("R-squared value:", r2_poly)

In [None]:
errors = y_test - y_pred_poly

# Create a kernel density plot of the errors
sns.kdeplot(errors)
plt.xlabel('Error')
plt.ylabel('Density')
plt.title('Distribution of Errors')
plt.figure(figsize=(20, 4))
sns.set(font_scale=1.5)
plt.show()

### Perform L1 Regularization

In [None]:
from sklearn.linear_model import Lasso

In [None]:
X = df_encoded.drop(['App', 'Installs'], axis=1) # predictor variables
y = df_encoded['Installs'] # target variable

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Create a Lasso regression model with a regularization strength of alpha=0.1
lasso = Lasso(alpha=0.1)
# Fit the model to the training data
lasso.fit(X_train, y_train)

In [None]:
# Predict the target variable for the testing data
y_pred = lasso.predict(X_test)

In [None]:
# Calculate the mean squared error of the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Print the mean squared error
print("Mean Squared Error:", mse)
print("R Squared Value:", r2)

### Perform L2 Regularization

In [None]:
from sklearn.linear_model import Ridge

In [None]:
X = df_encoded.drop(['App', 'Installs'], axis=1) # predictor variables
y = df_encoded['Installs'] # target variable

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Fit the model to the training data
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)

In [None]:
# Predict the target variable for the testing data
y_pred = ridge.predict(X_test)

In [None]:
# Calculate the mean squared error of the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the mean squared error
print("Mean Squared Error:", mse)
print("R Squared Value:", r2)

# 3. Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
X = df_encoded.drop(['App', 'Installs'], axis=1)
y = df_encoded['Installs']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
regressor = DecisionTreeRegressor(random_state=42)

regressor.fit(X_train, y_train)

In [None]:
y_pred = regressor.predict(X_test)

In [None]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean squared error:", mse)
print("R-squared value:", r2)

In [None]:
result = pd.DataFrame({'y_test': y_test, 'y_pred': y_pred })

result.tail(10)

In [None]:
residuals = y_test - y_pred

plt.scatter(y_test, residuals)
plt.axhline(y=0, color='r', linestyle='-')
plt.xlabel('y_test')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

### Datapoints closer to the lines resemble correct predictions. 

#  4. Random Forest Regression

In [None]:
X = df_encoded.drop(['Installs', 'App'], axis=1)
y = df_encoded['Installs']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

In [None]:
y_pred = rf.predict(X_test)

In [None]:
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

In [None]:

result = pd.DataFrame({'y_test': y_test, 'y_pred': y_pred })

result.tail()

In [None]:
residuals = y_test - y_pred

plt.scatter(y_test, residuals)
plt.axhline(y=0, color='r', linestyle='-')
plt.xlabel('y_test')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

### Datapoints closer to the lines resemble correct predictions. 