# Prediction Challenge Part 2: Report

## How likely is one to download your app?

In [1]:
import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

### I) Cleaning and Inspecting the Dataset

In [2]:
app = pd.read_csv('googleplaystore.csv')
app.isna().sum()

FileNotFoundError: [Errno 2] No such file or directory: 'googleplaystore.csv'

Getting rid of Nan values

In [3]:
app['Rating'].fillna(0, inplace = True)
app['Type'].fillna('Unknown', inplace = True)
app['Content Rating'].fillna('Unknown', inplace = True)
app['Android Ver'].fillna('Unknown', inplace = True)
app.isna().sum()

NameError: name 'app' is not defined

In [None]:
app.head()

#### Rating: Taking a look at the Rating column:

In [None]:
print(app.Rating.dtype)
print(app.Rating.unique())

#### Reviews: Converting Reviews to integer

In [None]:
print(app.Reviews.dtype)
print(app.Reviews.unique())
print(app.Reviews.describe())

In [None]:
app['Reviews'] = app['Reviews'].apply(lambda x: str(x))

In [None]:
def converting_m(value):
    if 'M' in value:
        return float(value.replace('M', '')) * 1e6
    return value

Here M represents million hence converting the M into a number

In [None]:
app['Reviews'] = app['Reviews'].apply(converting_m)

In [None]:
app['Reviews'] = app['Reviews'].apply(lambda x: int(x))

#### Size: Converting the letters in Size column:

Doing the same thing as above only with k being kilobyte we convert it to megabyte and remove the letters k and m from the numbers, we also deal with 'Varies with device'

In [None]:
def converting_size(size):
    if 'k' in size:
        return float(size.replace('k', '')) / 1024
    elif 'M' in size:
        return float(size.replace('M', ''))
    elif 'Varies with device' in size:
        return np.nan
app['Size'] = app['Size'].apply(converting_size)
app.rename(columns={'Size' : 'Size(Mb)'}, inplace=True)
print(app['Size(Mb)'].dtype)
print(app['Size(Mb)'].unique())

#### Installs: Converting the Installs column to numbers only:

In [None]:
app['Installs'] = app['Installs'].apply(lambda x: x.replace('+' , '') if '+' in str(x) else x)
app['Installs'] = app['Installs'].apply(lambda x: x.replace(',' , '') if ',' in str(x) else x)
app['Installs'] = app['Installs'].apply(lambda x: np.nan if 'Free' in str(x) else x)

app['Installs'] = pd.to_numeric(app['Installs'], errors='coerce')

print(app['Installs'].dtype)
print(app['Installs'].unique())

#### Type: Checking Type Column

In [None]:
app['Type'] = app['Type'].apply(lambda x: np.nan if '0' in str(x) else x)
print(app.Type.dtype)
print(app.Type.unique())

#### Price: converting them to float while removing/converting unnecessary values

In [None]:
app.Price.unique()

In [None]:
def converting_Everyone(val):
    if 'Everyone' in val:
        return float(val.replace('Everyone', '0'))
    return val


In [None]:
app['Price'] = app['Price'].apply(converting_Everyone)


In [None]:
app['Price'] = app['Price'].apply(lambda x: x.replace('$' , '') if '$' in str(x) else x)
app['Price'] = app['Price'].apply(lambda x: float(x))
app.rename(columns={'Price' : 'Price($)'}, inplace=True)


In [None]:
app['Price($)'].unique()

#### Content Rating:

In [None]:
print(app['Content Rating'].dtype)
print(app['Content Rating'].unique())
print(app['Content Rating'].describe())

In [None]:
app['Content Rating'] = app['Content Rating'].apply(lambda x: x.replace('Mature 17+' , 'Adults only 18+') if 'Mature 17+' in str(x) else x)
print(app['Content Rating'].dtype)
print(app['Content Rating'].unique())
print(app['Content Rating'].value_counts())

#### Last Updated Column:

In [None]:
print(app['Last Updated'].dtype)
print(app['Last Updated'].unique())

#### Genre:

In [None]:
print(app['Genres'].value_counts())

#### Category:

In [None]:
print(app['Category'].value_counts())

Since this data was created 5 years ago and is not updated on a consistent basis, last updated dates do not contributed to my pattern building.

#### Getting the head again

In [None]:
app.head()

### II) Feature Engineering: Adding more columns based on other columns

#### 1. Applying log transformation to reduce the skewness of Reviews:

In [None]:
#Taking a look at reviews:
plt.figure(figsize=(10, 6))
plt.hist(app['Reviews'], bins=50)
plt.title('Distribution of Reviews')
plt.xlabel('Reviews')
plt.ylabel('Frequency')
plt.show()
print(app['Reviews'].skew())

The skewness of Reviews is 16.4 which is very high hence we use log transformation to reduce it to avoid biases in data and for better model accuracy.

In [None]:
#Taking a look at reviews:
plt.figure(figsize=(10, 6))
plt.hist(app['Reviews'], bins=50, log = True)
plt.title('Distribution of Reviews')
plt.xlabel('Reviews')
plt.ylabel('Frequency (log scale)')
plt.show()

In [None]:
app['Log_Reviews'] = np.log(app['Reviews'] + 1)
app.head()
print(app['Log_Reviews'].skew())

The skewness is close to 0 as we can see above, hence it is closer to being symmetrical than asymmetrical after log transformation. Therefore we can use it for modeling.

#### 2. Combining Rating and Reviews as it could potentially mean that the app is more likely to be downloaded.

In [None]:
app['Rating_Review_Interaction'] = app['Rating'] * app['Log_Reviews']
app.head()

In [None]:
print(app['Rating_Review_Interaction'].skew())

From the above skewness we can see that the Rating_Review_Interaction is fairly symmetrical.

A high rating Review interaction number could mean that people loved the app a lot hence they took out time to write a review and to rate, or people did not like the app and they took out time to negatively review and rate. The prior has a higher possibility, as a good rating would fairly mean a good app and combined with high number of reviews would perphaps also indicate a resonably trending app.

#### 3. Taking a look at Installs to understand how to categorize them into groups:

Since installs directly impacts the likeliness to download the app, it is important to include it in a nuanced way in the pattern.

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(app['Installs'], bins=50, log=True)
plt.title('Distribution of Installs')
plt.xlabel('Installs')
plt.ylabel('Frequency (log scale)')
plt.show()

In [None]:
# Displaying basic statistics
print(app['Installs'].describe())
print(app['Installs'].skew())

Due to high skewness we may think to take a Threshold to be 75% considering the popular apps which would focus on the top 25% of apps by installs, resulting in a meaningful distinction between high and low likelihood for downloads

Since people will be more attracted when the number of downloads are high hence, taking only the top 25% will help in pattern may be a better idea considering highly skewed number of Installs, allowing me to focus on a subset of apps that are most relevant to the characteristics of highly likely download.

Making the pattern more complex we take another condition in deciciding, the Rating > 4.5

In [None]:
threshold = 5_000_000  # 5 million installs
app['High Number of Installs(more than 5M and rating>=4.5)'] = ((app['Installs'] >= threshold) & (app['Rating'] >= 4.5)).astype(int)
app.head()

#### 4. Using Size in our model:

Another very relevant feature is Size. With good rating, reviews, it is important that the app is of reasonable size. Therefore, including size in the pattern is significant.

In [None]:
# Displaying basic statistics
print(app['Size(Mb)'].describe())
print(app['Size(Mb)'].skew())

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(app['Size(Mb)'], bins=50)
plt.title('Distribution of Size(Mb)')
plt.xlabel('Size(Mb)')
plt.ylabel('Frequency')
plt.show()

As we can see that Size as well is skewed hence we can use log transformation as well:

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(app['Size(Mb)'], bins=50, log=True)
plt.title('Distribution of Size(Mb)')
plt.xlabel('Size(Mb)')
plt.ylabel('Frequency (log scale)')
plt.show()

Similar to what we did for installs, but here we take the threshold as the value of the 75% and below. Hence, we include apps that have size less than 30mb and to make the pattern more complex we take another condition in deciciding as above, the Rating > 3 which is a resonable rating

In [None]:
threshold = 30 
app['Size(<=30andrating>3)'] = ((app['Size(Mb)'] < threshold) & (app['Rating'] >= 3)).astype(int)
app.head()

#### 5. Getting Composite Score!

To create a more impactful pattern, a composite score may come in handy therefore, I created one based on a four important columns:

In [None]:
#Weights for each componenet
RatingWeight = 1
RatingReviewInteractionWeight = 2
LogSizeWeight = 2
priceweight = 0.5
installsweight = 1.5
# Calculate the composite score
app['Composite_Score'] = ((RatingWeight * app['Rating']) +
    (RatingReviewInteractionWeight * app['Rating_Review_Interaction']) -
    (LogSizeWeight * app['Size(<=30andrating>3)']) + (installsweight* app['High Number of Installs(more than 5M and rating>=4.5)']))

app.head()

We want high rating hence we have a good weight for Rating. We also want a high rating and review interaction value hance we + to the composite score, with again a good weight. Coming to the size, we want less size and it is important as people with low storage may struggle. Then we also want a low size and finally high number of installs

In [None]:
print(app['Composite_Score'].describe())
print(app['Composite_Score'].skew())

### III) Dealing with NaNs and missing values:

In [None]:
for column in app.columns:
    if app[column].dtype in ['int64', 'float64']:
        median = app[column].median()
        app[column].fillna(median, inplace=True)
    else:
        mode = app[column].mode()[0]  
        app[column].fillna(mode, inplace=True)

### IV) Creating the Target Column:

In [None]:
rating_review_interaction = app['Rating_Review_Interaction'].quantile(0.75) #Have high rating-review interaction
median_rating_review_interaction = app['Rating_Review_Interaction'].quantile(0.5)#Have above medium rating review interaction
low_rating_review_interaction = app['Rating_Review_Interaction'].quantile(0.25)
# Calculate size thresholds: excessively large, under the assumption that very large apps might be less likely to be downloaded due to storage and data constraints. 
  # Median
def Likely_to_Download(row):
    if (row['Type'] == 'Free' and
        row['Rating_Review_Interaction'] > rating_review_interaction and
        row['Category'] in ['FAMILY','GAME', 'BUSINESS', 'MEDICAL', 'TOOLS'] and 
        row['Content Rating'] == 'Everyone' and
        row['Composite_Score'] > 100 and
        row['Size(<=30andrating>3)'] == 1 and
        row['High Number of Installs(more than 5M and rating>=4.5)'] == 1):
        return 'Highly Likely'
    elif (row['Type'] == 'Free' and
          row['Rating_Review_Interaction'] > median_rating_review_interaction and
          row['Size(<=30andrating>3)'] == 0 and
          row['High Number of Installs(more than 5M and rating>=4.5)'] == 0):
        return 'Likely'
    elif (row['Rating_Review_Interaction'] > low_rating_review_interaction):
        return 'Not Likely'
    else:
        return 'Disregard'
app['Likely to be Downloaded'] = app.apply(Likely_to_Download, axis=1)
app.head()

In [None]:
file_path = '/Users/anushaparanjpe/Downloads/PredictionChallenge/modified_googleapp_dataset.csv'
app.to_csv(file_path, index=False)

In the highly Likely:

The apps should be free
The Rating_Review Interaction should be high
In Category, 'FAMILY','GAME', 'BUSINESS', 'MEDICAL', 'TOOLS' are some of the top ones hence more apps from these categories are likely to be downloaded.
Content Rating: Everyone, 'Everyone' has the most count and most likely to be a part of highly likely downloads.
Composite Score > 100 as we can 75% 97.044412 hence it is a resonable approach
row['Size(<=30andrating>3)'] == 1 is understood as less than 30 mb and rating greater than 3
'High Number of Installs(more than 5M and rating>=4.5)'] == 1) can be understood as well as more people are drawn when there are higher number of installs. In a similar fashion the Likely attribute was made and lastly the rest would be Not Likely

#### Taking a look at all the highly likely rows

In [None]:
app.groupby('Likely to be Downloaded').get_group('Highly Likely')

---------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------

### Hypothetical Situation:

##### Coming to the hypothetical Sitatuation where a Student tries to build a model using this dataset to predict the response variable: Likely to be Downloaded

In [None]:
#Using the dataset after cleaning:
student_data = app.drop(['Log_Reviews','Rating_Review_Interaction','High Number of Installs(more than 5M and rating>=4.5)','Size(<=30andrating>3)','Composite_Score'], axis = 1)
print(student_data.isna().sum())

Looking at the missing values, the student would first think to deal with them in order to create a successful model:

In [None]:
student_data.head()

Looking at the columns the student may resonably think that columns like Unnamed, App, Last Updated may not be of relevence, as the dataset is 5 years old.

#### Now to inspect the columns and think which columns would impact the the response column:

1. The student may start with visualizations as they help the most in understanding the dataset: The most relevant columns to the likeliness of downloading an app would be the Reviews, the Ratings, number of installs and perhaps the Price

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")

##Getting the histogram of each of the columns mentioned above: 'Rating', 'Reviews', 'Price', 'Installs'
fig, ax = plt.subplots(1, 5, figsize=(18, 6))
sns.histplot(student_data['Rating'].dropna(), bins=20, kde=True, ax=ax[0])
ax[0].set_title('Ratings Distribution')

sns.histplot(student_data['Reviews'], bins=20, kde=True, ax=ax[1])
ax[1].set_title('Reviews Distribution')

sns.histplot(student_data['Price($)'].dropna(), bins=20, kde=True, ax=ax[2])
ax[2].set_title('Price Distribution')

sns.histplot(student_data['Installs'].dropna(), bins=20, kde=True, ax=ax[3])
ax[3].set_title('Installs Distribution')

sns.histplot(student_data['Size(Mb)'].dropna(), bins=20, kde=True, ax=ax[4])
ax[4].set_title('Size(Mb) Distribution')

plt.tight_layout()
plt.show()

#### From the above plots it may hint the student that the columns Reviews, Price and Installs are highly skewed. Hence, the student may attempt to individually check these columns.

i) Reviews:

In [None]:
print(student_data['Reviews'].describe())
print(student_data['Reviews'].skew())

In [None]:
student_data['Log_Reviews'] = np.log(student_data['Reviews'] + 1)
print(student_data['Log_Reviews'].skew())
student_data.head()

ii) Price:

In [None]:
print(student_data['Price($)'].describe())
print(student_data['Price($)'].skew())

The above value of skewness is very high and portrays the following:

1. 75% of the observations have values of 0 or higher indicating that a large amount of the dataset consists of zeros.

2. The value at max is 400, highlighting the presence of extreme values or outliers that affect the mean and the standard deviation.

Hence using the price column does not sound reasonable.

Therefore, instead of Price the student may use the Type column for modeling:

In [None]:
student_data['Free_apps'] = student_data['Type'].apply(lambda x: 1 if x == 'Free' else 0)
student_data.head()

iii) Installs:

In [None]:
print(student_data['Installs'].describe())
print(student_data['Installs'].skew())

Through this it can be seen that 75 percent of the data in Installs column has values 5M or higher. But with this information it may not come to the student's mind to create a column based on this in order to get the top 25 percent or the popular apps. Although it may clue the student to by seeing the number at 75%.

The student may set a general threshold:

In [None]:
student_data['High Number of Installs'] = (student_data['Installs'] > 1e6).astype(int)
student_data.head()

iv) Size:

In [None]:
print(student_data['Size(Mb)'].describe())
print(student_data['Size(Mb)'].skew())

The student may apply log transformation to Size:

In [None]:
student_data['Log_Size(Mb)'] = np.log(student_data['Size(Mb)'] + 1)
print(student_data['Log_Size(Mb)'].skew())
student_data.head()

Now the skewness is very low hence it can be used of modeling

#### With this current information the student may start creating models:

Getting the data ready to create models:

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Features and Target variable
X = pd.get_dummies(student_data.drop(columns=['Likely to be Downloaded', 'App', 'Category',  'Content Rating', 'Genres', 'Last Updated']),drop_first=True)#dropping the target variable and using one hot encoding
y = student_data['Likely to be Downloaded']
y = le.fit_transform(y)
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardizing the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled.shape, X_test_scaled.shape, y_train.shape, y_test.shape

#### Model 1: Using Logistic Regression: Baseline

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score

log_reg = LogisticRegression(max_iter=1000)
log_reg.fit(X_train_scaled, y_train)
y_pred = log_reg.predict(X_test_scaled)

# Evaluating the model
log_reg_accuracy = accuracy_score(y_test, y_pred)
log_reg_f1 = f1_score(y_test, y_pred, average = 'macro')

log_reg_accuracy, log_reg_f1

#### Model 2: Using Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train_scaled, y_train)
#X_train_filled = np.nan_to_num(X_train_scaled, nan=np.nanmedian(X_train_scaled, axis=0))
#X_test_filled = np.nan_to_num(X_test_scaled, nan=np.nanmedian(X_test_scaled, axis=0))

# Predicting on the test set
y_pred_rf = rf.predict(X_test_scaled)

# Evaluating the model
rf_accuracy = accuracy_score(y_test, y_pred_rf)
rf_f1 = f1_score(y_test, y_pred_rf, average = 'macro')

rf_accuracy, rf_f1

#### Therefore, if the student is able to find the key points of skewness and log transformations, it is possible to get a good prediction using random forest model