-------------------------------------------------------------------------------------------------------------------------------

## Mobile Phone Analysis

Flipkart is a famous Indian E-Commerce website, that offers a wide range of products right from clothing to electronics to edibles to home appliances and whatnot. In this notebook, we closely analyze and visualize the details of a specific product sold on Flipkart. A cell phone is also known as a mobile phone or simply a phone has become a necessity than just a luxury. In this era, each and everyone has their mobile phones. Do we sometimes wonder which phone is the best for us? Many of us generally know what we want, so why not help those who don't? To answer these questions we create 2 data models which will be as follows<br><br>
<u>Model 1:</u><br>
A basic model that predicts the Sales Price of each phone.<br><br>
<u>Model 2:</u><br>
We will introduce an additional attribute that would denote which phone would be the best fit for an appropriate group and later build a predictive model as well.<br>

For this we will observe the following three conditions:
<li>A customer with necessities won't buy a high-priced phone.
<li>Young customers will prefer a higher memory phone.
<li>Customer with high ended needs would need a faster phone with good memory.<br><br>

-------------------------------------------------------------------------------------------------------------------------------

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## Data Loading

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('/kaggle/input/MobilePhones.csv')

In [None]:
len(df)

In [None]:
df.info()

In [None]:
df.describe()

-------------------------------------------------------------------------------------------------------------------------------

## Data Manipulation

In [None]:
df.head()

In [None]:
df['MobileName'][0].split("(")[1].split(",")

In [None]:
df['Color'] = df['MobileName'].apply(lambda x : x.split(",")[0].split("(")[1] 
                                        if len(x.split(",")[0].split("(")) > 1 else 'No Color')

df.head()

In [None]:
df['Brand'] = df['MobileName'].apply(lambda x : x.split()[0])
df['Brand'] = df['Brand'].apply(lambda x : 'I Kall' if x == 'I' else x)
df.head()

In [None]:
df['MobileName'] = df['MobileName'].apply(lambda x : x.split("(")[0])
df.head()

In [None]:
df['Discount'] = df['ListPrice'] - df['SalesPrice']
df.head()

-------------------------------------------------------------------------------------------------------------------------------

## Data Visualization

In [None]:
print(df['Brand'].value_counts())

plt.figure(figsize=(10,5))
sns.countplot('Brand', data=df)

There are many products available that are manufactured by "RealMe". Let us now view how many distinct products are manufactured and sold by each of these brands.

In [None]:
print(df['MobileName'].value_counts()[:20])

plt.figure(figsize=(20,5))
sns.countplot('MobileName', data=df)
plt.xticks(rotation=60)

We observe that "Realme" is the only brand that does not manufacture only have a single model, but have a variety of models available - "Realme 5i, Realme 6, Realme C11, Realme Narzo 10A, Realme 6 Pro and many". Whereas other products have only variations within their model.
The "Realme Narzo 10A" mobile phones available for sale have the most variation within the model.

In [None]:
print(df['Stars'].value_counts())

sns.distplot(df['Stars'])

The stars given to the product lie between 3.0 to 4.6.

In [None]:
print("Phones with lowest stars")
print("\n".join(df[df['Stars']==3.0]['MobileName'].unique()))

print("\nPhones with highest stars")
print("\n".join(df[df['Stars']==4.6]['MobileName'].unique()))

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(df['Brand'],df['Stars'],data=df)

Apart from "I Kall" the stars given to all other brands are better. Probably the reason being that the company may not be known by many customers. Thus does not satisfy the consumer mentality of fimilarity.  

In [None]:
discount = df['Discount'].value_counts()[:5]
discount.plot(kind='bar',title='Top 5 Discount Rate (In Rupees)')

Some products do not offer any discount. But, among those that offer 1000₹ is the most offered discount.

In [None]:
print("*** RAM *** ")
print(df['RAM_GB'].value_counts())
print("\n*** ROM *** ")
print(df['ROM_GB'].value_counts())

There are 3 inconsistencies in the values of the RAM and ROM. Thus we drop these 3 rows.  

In [None]:
print(df[df['RAM_GB'] == 32].index)
print(df[df['ROM_GB'] == 4].index)

In [None]:
df.drop([115,118,80], inplace=True,axis=0)

In [None]:
print("*** RAM *** ")
print(df['RAM_GB'].value_counts())
print("\n*** ROM *** ")
print(df['ROM_GB'].value_counts())

plt.figure(figsize=(14,5))

plt.subplot(1,2,1)
plt.title("RAM Space in GB")
sns.countplot('RAM_GB', data=df)
plt.xlabel("GB")

plt.subplot(1,2,2)
plt.title("ROM Space in GB")
sns.countplot('ROM_GB', data=df)
plt.xlabel("GB")

The maximum amount of RAM and ROM are 8GB and 128GB respectively. There are many Mobile Phones with a 4GB RAM and 64GB ROM.

In [None]:
print(df['Color'].value_counts()[:10])

popcol = df['Color'].value_counts()[:10]

plt.figure(figsize=(10,5))
popcol.plot(kind='bar')

Blue and White-colored phones are largely available.

In [None]:
plt.figure(figsize=(10,5))

plt.subplot(2,2,1)
plt.title("Ratings")
sns.boxplot('Ratings', data=df)

plt.subplot(2,2,2)
plt.title("Reviews")
sns.boxplot('Reviews', data=df)

plt.subplot(2,2,3)
plt.title("List Price")
sns.boxplot('ListPrice', data=df)

plt.subplot(2,2,4)
plt.title("Sales Price")
sns.boxplot('SalesPrice', data=df)

plt.tight_layout(pad=2.0)

We notice each of these attributes has an outlier which could probably be one product itself that causes to be as an outlier for all the other attributes. We will need to handle this as outliers would lead to building a wrong predictive model. We will first deal with the outliers with the help of the Reviews attribute.

In [None]:
df = df[df['Reviews'] < 5500]
df = df[df['Ratings'] < 60000]

In [None]:
plt.figure(figsize=(10,5))

plt.subplot(2,2,1)
plt.title("Ratings")
sns.boxplot('Ratings', data=df)

plt.subplot(2,2,2)
plt.title("Reviews")
sns.boxplot('Reviews', data=df)

plt.subplot(2,2,3)
plt.title("List Price")
sns.boxplot('ListPrice', data=df)

plt.subplot(2,2,4)
plt.title("Sales Price")
sns.boxplot('SalesPrice', data=df)

plt.tight_layout(pad=2.0)

So, we can see that our assumption that the outlier may belong to a single product was wrong thus we need to handle this additional outlier before we build our predictive model.

In [None]:
df = df[df['ListPrice'] < 30000]

In [None]:
plt.figure(figsize=(10,5))

plt.subplot(2,2,1)
plt.title("Ratings")
sns.boxplot('Ratings', data=df)

plt.subplot(2,2,2)
plt.title("Reviews")
sns.boxplot('Reviews', data=df)

plt.subplot(2,2,3)
plt.title("List Price")
sns.boxplot('ListPrice', data=df)

plt.subplot(2,2,4)
plt.title("Sales Price")
sns.boxplot('SalesPrice', data=df)

plt.tight_layout(pad=2.0)

Finally, the two outliers are handled and we can proceed with other analyses. Let us now see some relationships between various attributes.

In [None]:
plt.figure(figsize=(10,5))
sns.heatmap(df.corr(), annot=True)

We observe that there is a high positive correlation between the size of the RAM and the List and Sales Price. There is also a positive correlation between Ratings and Review and List and Sales Price. Also, the attribute Discount has a negative correlation with the attributes of ROM_GB, Ratings, and Reviews. We will now analyze positively and negatively related attributes. 

In [None]:
plt.figure(figsize=(10,5))

plt.suptitle("Correlation Between Attributes")

plt.subplot(1,2,1)
plt.title("Postive Correlation")
plt.scatter(df['Ratings'],df['Reviews'], marker='v')
plt.xlabel("Ratings")
plt.ylabel("Review")

plt.subplot(1,2,2)
plt.title("Negative Correlation")
plt.scatter(df['Discount'],df['Reviews'], marker='v')
plt.xlabel("Discount in Rupees")
plt.ylabel("Ratings")

plt.tight_layout(pad=3.5)

We observe a strong positive correlation between Ratings and Reviews which means with every increase in the number of ratings there will be an equivalent increase in the number of reviews. Whereas in the negative correlation we spot a slightly negative relation, that means for some products that have lesser discounts receive fewer numbers or ratings and reviews.

In [None]:
table = pd.pivot_table(df, index='Brand', values=['SalesPrice','Discount','Ratings'])
table

In [None]:
table.plot(kind='bar',figsize=(10,5))

I Kall offers a very less discount whereas OPPO has the highest. As far as the Ratings are concerned we observe that POCO has received a far better number of ratings as compared to the rest of the brands. Also, its Sales Price is much more than the rest.

In [None]:
plt.figure(figsize=(10,5))
sns.countplot('Brand', data=df)

Clubbing the above two graphs we can get a lot of insights. Although Realme has a variety of products their ratings are far less than POCO's even though POCO offers a limited number of models. It seems like a variation of models does no good to OPPO as well.

-------------------------------------------------------------------------------------------------------------------------------

## Summary of Analysis

<li>Several cell phones manufactured by Realme are available. These phones not only provide a variety of models but also has many variations concerning the &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;color of the models.</li>
<li>The rating of Model "I Kall K6" is very low, whereas the ratings of model "Realme C11, Apple iPhone XR, and Realme Narzo 10A" are high. This indicates &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;that these two models are very popular among people and satisfy customers' needs.</li>
<li>While analysis we realized there were some data inconsistencies in the value of RAM and ROM, since the number of inconsistent values is very small we &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;choose to drop them by their row index.</li>
<li>Several phones are having basic configurations that consist of a 4GB RAM and 64GB ROM where the ROM is nothing but the external storage capacity of &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a phone.</li>
<li>We observe that there are many variations on the colors Blue and White, probably these colors are well demanded by the customers.</li>
<li>We came across several outliers in the attributes Ratings, Reviews, and List Price. We eliminated them by imposing conditions on the data frame.</li>
<li>Several positive and negative relationship was found between many attributes.</li>
<li>Lastly, we notice that despite the number of available POCO phones are less this product receives a good amount of ratings for their phone and they offer a &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;moderate amount of discount. On the other hand, OPPO and Realme having a variation of its models fail to receive equivalent ratings.</li> 

 

-------------------------------------------------------------------------------------------------------------------------------

## Feature Engineering

In [None]:
df.head()

In [None]:
df.Brand.value_counts().index

In [None]:
df['Brand'] = df['Brand'].map({'Realme':0,'Vivo':1,'OPPO':2,'I Kall':3,'Redmi':4,
                               'Infinix':5,'POCO':7,'Motorola':8,'Tecno':9})   
df.head(10)

-------------------------------------------------------------------------------------------------------------------------------

## Data Modeling

-------------------------------------------------------------------------------------------------------------------------------

## <li> Model 1

We will first build a model that can predict the Sales Price of each phone.
For the purpose of Data Modeling we need to split our data into training and test set. Once the split is done we can put our data into various models and check each the precision of each model. We select the model with the highest precision score.

In [None]:
from sklearn.model_selection import train_test_split 
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
dfnumberic = df.select_dtypes(include=[np.number])
dfnumberic.head()

In [None]:
print("Shape of the numberic data frame")
print(dfnumberic.shape)

In [None]:
X = dfnumberic.drop('SalesPrice',axis=1)
y = dfnumberic['SalesPrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

## 1. Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lrm = LinearRegression()
lrm.fit(X_train,y_train)

In [None]:
predictionslrm = lrm.predict(X_test)

In [None]:
scorelrm = round((lrm.score(X_test, y_test)*100),2)
print ("Model Score:",scorelrm,"%")

## 2. Ridge Regression

In [None]:
from sklearn.linear_model import Ridge

In [None]:
rrm = Ridge(alpha=100)
rrm.fit(X_train,y_train)

In [None]:
predictionrrm = rrm.predict(X_test)

In [None]:
scorerrm = round((rrm.score(X_test, y_test)*100),2)
print ("Model Score:",scorerrm,"%")

In [None]:
data = [['Linear Regression',scorelrm],['Ridge Regression',scorerrm]]
final = pd.DataFrame(data,columns=['Algorithm','Precision'],index=[1,2])
final

We see that both our models are able to perfectly predict the price of the phone but in reality, this would not be completely possible. This result of a perfect score is caused because of overfitting, due to the small amount of data.  

------------------------------------------------------------------------------------------------------------------------------

## <li> Model 2

In [None]:
df.head()

We now create an attribute that suggests which phone is suitable for what kind of a person. For this, we will use only 3 attributes from the data frame (df) they are RAM_GB, ROM_GB, and SalePrice.

In [None]:
df['UserType'] = 'Teen'
high = df[(df['RAM_GB'] > 4) & (df['ROM_GB'] > 32)].index
low =  df[(df['SalesPrice'] < 12000) & (df['ROM_GB'] < 64)].index

In [None]:
for i in high:
    df['UserType'].loc[i] = 'High'
for i in low:
    if i not in high:
        df['UserType'].loc[i] = 'Low'

In [None]:
df['UserType'].value_counts()

In [None]:
df.head()

In [None]:
df['UserType'] = df['UserType'].map({'High':0,'Teen':1,'Low':2})

In [None]:
dfnumberic = df.select_dtypes(include=[np.number]).drop('ListPrice', axis=1)
dfnumberic.head()

In [None]:
dfnumberic[['Ratings','Reviews','Stars','SalesPrice','Discount']].describe()

In [None]:
def ratings(num):
    if num < 10000:
        return 1
    elif num >= 10000 & num < 20000:
        return 2
    elif num >= 20000 & num < 30000:
        return 3
    elif num >= 30000 & num < 40000:
        return 4
    elif num >= 40000 & num < 50000:
        return 5
    else:
        return 6
    
    
def reviews(num):
    if num < 1000:
        return 1
    elif num >= 1000 & num < 2000:
        return 2
    elif num >= 2000 & num < 3000:
        return 3
    elif num >= 3000 & num < 4000:
        return 4
    else:
        return 5

    

def salesprice(num):
    if num < 5000:
        return 1
    elif num >= 5000 & num < 10000:
        return 2
    elif num >= 10000 & num < 15000:
        return 3
    elif num >= 15000 & num < 20000:
        return 4
    else:
        return 5
    
def stars(num):
    if num < 3.0:
        return 1
    elif num >= 3 and num < 3.5:
        return 2
    elif num >= 3.5 and num < 4.0:
        return 3
    elif num >= 4.0 and num < 4.5:
        return 4
    else:
        return 5


def discount(num):
    if num == 0:
        return 0
    elif num < 1200:
        return 1
    elif num >= 1200 & num < 2400:
        return 2
    elif num >= 2400 & num < 3600:
        return 3
    elif num >= 4800 & num < 6000:
        return 4
    else:
        return 5

In [None]:
dfnumberic['Ratings'] = dfnumberic['Ratings'].apply(ratings)
dfnumberic['Reviews'] = dfnumberic['Reviews'].apply(reviews)
dfnumberic['Stars'] = dfnumberic['Stars'].apply(stars)
dfnumberic['SalesPrice'] = dfnumberic['SalesPrice'].apply(salesprice)
dfnumberic['Discount'] = dfnumberic['Discount'].apply(discount)

In [None]:
dfnumberic.head()

In [None]:
X = dfnumberic.drop(['UserType'],axis=1)
y = dfnumberic['UserType']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

## 1. Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rmodel = RandomForestClassifier(n_estimators=100)
rmodel.fit(X_train,y_train)

In [None]:
rprediction = rmodel.predict(X_test)
print("Confusion Matrix")
print(confusion_matrix(y_test,rprediction))

rscore = round((rmodel.score(X_test, y_test)*100),2)
print ("\nModel Score:",rscore,"%")