# Windows Store : EDA
## Dataset of the apps in the Microsoft Windows Store
<img src="https://cdn.windowsreport.com/wp-content/uploads/2019/04/Windows-Store-needs-to-be-online.png" width=200><br>
In this notebook I aim to explore the windows store data and gain insights on the most popular apps and figure out what apps do the general public like and rate heavily.

<ul>
    <h2>Table of Contents</h2>
    <li><a href="#1">Data Summary</a></li>
    <li><a href="#2">Highest Rated Apps</a></li>
    <li><a href="#3">Apps with Lowest Rating</a></li>
    <li><a href="#4">Ratings Distribution</a></li>
    <li><a href="#5">Apps With Most Ratings</a></li>
    <li><a href="#6">Most Popular Category by No of Apps</a></li>
    <li><a href="#7">Most Popular Category by Rating</a></li>
    <li><a href="#8">Most Popular Category by Average No of Ratings</a></li>
    <li><a href="#9">Temporal Distributions</a></li>
    <li><a href="#10">Increase in Number of Apps over the years</a></li>
    <li><a href="#11">Free vs Paid apps</a></li>
    <li><a href="#12">When did Paid Apps become a thing?</a></li>
    <li><a href="#13">Most Expensive Apps?</a></li>
    <li><a href="#14">Best Paid App?</a></li>
</ul>

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from collections import defaultdict
import plotly

plt.rcParams['figure.figsize'] = 8, 5
plt.style.use("fivethirtyeight")
pd.options.plotting.backend = "plotly"

In [None]:
data = pd.read_csv('../input/windows-store/msft.csv')
data.head()

## <a id="1">Data Summary</a>

In [None]:
data.shape

In [None]:
data.info()

In [None]:
def NullUnique(df):
    dic = defaultdict(list)
    for col in df.columns:
        dic['Feature'].append(col)
        dic['NumUnique'].append(len(df[col].unique()))
        dic['NumNull'].append(df[col].isnull().sum())
        dic['%Null'].append(round(df[col].isnull().sum()/df.shape[0] * 100,2))
    return pd.DataFrame(dict(dic)).sort_values(['%Null'],ascending=False).style.background_gradient()

In [None]:
NullUnique(data)

**Observations**:
- Almost all data is present
- 5/6 features have only 1 Nan value

In [None]:
data.iloc[-1]

All Nan values come from same row.

In [None]:
data.drop(5321, axis=0, inplace = True)

### Data Content
* Name: Name of the app.
* Rating: Rating for the app.
* No of People Rated : No of people who rated the app.
* Category : Category of the app.
* Date. : Date when it is posted.
* Price. : Price of the app.

## <a id="2">Highest Rated Apps</a>

In [None]:
fig = px.bar(x=data.nlargest(n=10, columns="Rating")["Name"],
             y=data.nlargest(n=10, columns="Rating")["Rating"], 
             color=data.nlargest(n=10, columns="Rating")["Name"].values,)
fig.update_xaxes(title="Ratings")
fig.update_yaxes(title="Names")
fig.update_layout(title= "Top rated apps", height = 600, width = 800, showlegend=False)
fig.show()

## <a id="3">Apps with Lowest Rating</a>

In [None]:
fig = px.bar(x=data.nsmallest(n=10, columns="Rating")["Name"],
             y=data.nsmallest(n=10, columns="Rating")["Rating"], 
             color=data.nlargest(n=10, columns="Rating")["Name"].values,)
fig.update_xaxes(title="Ratings")
fig.update_yaxes(title="Names")
fig.update_layout(title= "Lowest rated apps", height = 600, width = 800, showlegend=False)
fig.show()

## <a id="4">Ratings Distribution</a>

In [None]:
data.Rating.hist()

**Observations**:
- Most applications have a rating of 4.0
- There are more 5.0 rated apps than any other rating (except 4.0)

## <a id="5">Apps With Most Ratings</a>

In [None]:
data.sort_values(['No of people Rated'],ascending=False).iloc[:10][['Name','Rating','No of people Rated']].style.background_gradient()

## <a id="6">Most Popular Category by No of Apps</a>

In [None]:
fig = px.bar(x=data.groupby(['Category']).agg('count').Rating.index,y=data.groupby(['Category']).agg('count').Rating.values,color=data.groupby(['Category']).agg('count').Rating.values)
fig.update_layout(title='Most Popular Category by No of Apps')
fig.show()

**Observations**:
- Most apps on the app store are Music apps : 753
- Books and Business apps are next most popular by count
- The least No of Apps comes from Government and Politics Category

## <a id="7">Most Popular Category by Rating</a>

In [None]:
fig = px.bar(x=data.groupby(['Category']).agg('mean').Rating.index,y=data.groupby(['Category']).agg('mean').Rating.values,color=data.groupby(['Category']).agg('mean').Rating.values)
fig.update_layout(title='Most Popular Category by Rating')
fig.show()

**Observations**:
- Government and Politics apps have highest mean Rating
- Kids and Family comes in 2nd with average rating of 3.9
- Multimedia Design has lowest mean rating of 3.55

## <a id="8">Most Popular Category by Average No of Ratings</a>

In [None]:
fig = px.bar(x=data.groupby(['Category']).agg('mean')['No of people Rated'].index,y=data.groupby(['Category']).agg('mean')['No of people Rated'].values,color=data.groupby(['Category']).agg('mean')['No of people Rated'].values)
fig.update_layout(title='Most Popular Category by No. of Rating')
fig.show()

**Observations**:
- Multimedia Design has highest no of rating though it has least mean rating
- Social comes in 2nd with on average 575 Ratings
- Music apps have least No of Rating with an average value of 539

## <a id="15">Distribution of No of Ratings Across Ratings</a>

In [None]:
fig = px.box(data,x='Rating',y='No of people Rated')
fig.update_layout(title = 'Distribution of No of Ratings Across Ratings')
fig.show()

## <a id="9">Temporal Distributions</a>

In [None]:
data.Date = pd.to_datetime(data.Date)

In [None]:
fig = go.Figure(go.Scatter(
    x = data.groupby(['Date']).agg('count').Rating.index , 
    y =data.groupby(['Date']).agg('count').Rating.values,
    ))
fig.update_layout(title='Temporal Distribution App Uploads')
fig.show()

**Observations**:
- 2018 say a spike in uploads
- Frequency of uploads has risen considerably from inital period of 2012
- 2016-2018 saw high upload counts

## <a id="10">Increase in Number of Apps over the years</a>

In [None]:
fig = go.Figure(go.Scatter(x=data.groupby("Date").agg({"Date": "count"}).sort_index()["Date"].cumsum().index,
    y=data.groupby("Date").agg({"Date": "count"}).sort_index()["Date"].cumsum()))
fig.update_layout(title='Rise of The Apps')
fig.show()

**Observations**:
- Slope from 2016 to 2018 is almost straight indicating a constant steady rise in number of apps
- Slope is plateauing in 2020 indicating a slow down in app releases

## <a id="11">Free vs Paid apps</a>

In [None]:
data['PriceCat'] = data.Price
data.PriceCat.loc[data.PriceCat != "Free"] = "Paid"
data.PriceCat.unique()

In [None]:
data.PriceCat.hist()

**Observations**:
- Most apps on the market are Free
- Free : 5163 
- Paid : 158

## <a id="12">When did Paid Apps become a thing?</a>

In [None]:
fig = px.scatter(
    x = data.Date , 
    y =data.index,
    color=data.PriceCat
    )
fig.update_layout(title='When did Paid Apps become a thing?')
fig.show()

**Observations**:
- There weren't many paid apps initially
- Paid apps started picking up post 2016
- 2020 has seen highest number of Paid app releases
- Given all this the frequency of Free apps has not been hindered
- Though 2020 has seen fewer app releases than before

## <a id="13">Most Expensive Apps?</a>

In [None]:
data[data["Price"] == "Free"] = 0
data["Price"] = data["Price"].str.replace("₹ ", "")
data["Price"] = data["Price"].str.replace(",","")
data["Price"].fillna(0, inplace=True)
data["Price"] = data["Price"].astype(float)

In [None]:
fig = go.Figure([go.Bar(y=data.nlargest(10, columns="Price")["Price"].values, 
                     x=data.nlargest(10, columns="Price")["Name"], 
                     text=data.nlargest(10, columns="Price")["Price"].values,)])
fig.update_layout(title='Most Expensive Apps')
fig.show()

**Observations**:
- Most expensive app is **Pengwin Enterprise** at 5449 credits

## <a id="14">Best Paid App?</a>

In [None]:
fig = go.Figure([go.Bar(y=data.query("Rating == 5").nsmallest(10, columns="Price")["Price"], 
                     x=data.query("Rating == 5").nsmallest(10, columns="Price")["Name"], 
                     text=data.query("Rating == 5").nsmallest(10, columns="Price")["Price"])])
fig.update_layout(title='Best Rated and Inexpensive Apps')
fig.show()       

**Observations**:
- Top Rated inexpensive app is **Mobdus Monitor** with 5.0 rating available at 54 credits
- Next is the Bible (King James Version) available at 69 credits

If You find this Notebook insightful do UPVOTE!