# Windows Store 

<img src="https://www.slashgear.com/wp-content/uploads/2015/02/Windows_logo_Cyan_rgb_D-800x420.png" width="500" height="600">

<blockquote> Exploratory Data Analysis on this dataset</blockquote>

## Upvote My Kernel If you like it

 - Contains 5322 rows and 6 columns
<ul>
<li>Name: Name of the app.</li>
<li> Rating: Rating for the app.</li>
<li>No of People Rated : No of people who rated the app.</li>
<li>Category : Category of the app.</li>
<li>Date. : Date when it is posted.</li>
<li>Price. : Price of the app.</li>
</ul>

## Importing Modules

In [None]:
from pandas import read_csv, Grouper, DataFrame, concat
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline

### Importing Dataset as ws

In [None]:
ws=pd.read_csv("../input/windows-store/msft.csv")

In [None]:
ws.head(5)

In [None]:
ws.axes

In [None]:
ws.nunique()

> In the above we can clearly see that Rating and Category are having less unique so that we can use it
<br>

# Missing values

In [None]:
ws.isnull().mean().sort_values(ascending=False)

> We can see that missing values are too less instead filling up with mean or median , we can delete it, it is not going to affect the analysis anyway. 

In [None]:
#for removing null rows
ws=ws.dropna()

ws.isnull().mean().sort_values(ascending=False)

> Missing values's row has been deleted

# Exploratory Data Analysis


In [None]:
sns.set(font_scale=1.4)
#for bar plot
ws["Rating"].value_counts().plot(kind='bar', figsize=(7, 5), rot=0)
plt.xlabel("Ratings", labelpad=14)
plt.ylabel("Number of apps", labelpad=14)
plt.title("Number of App Counts by Ratings", y=1.02)

> In the above bar plot we have number of App by their ratings based , We can see that 4.0 ratings is given for more than 1200 of apps , 5.0 ratings is less than 1000 , and also this 
flow shows the fair of the dataset.

In [None]:
sns.set(font_scale=1.4)
#for barh plot
ws["Category"].value_counts().plot(kind='barh', figsize=(15, 5), rot=0)
plt.xlabel("Ratings", labelpad=14)
plt.ylabel("Category of the App", labelpad=14)
plt.title("Number of App by Category", y=1.02);

> In the above barh plot , we have the count of apps by their based on the category.

In [None]:
rt=ws.loc[:,["Name","Rating","No of people Rated"]]

#sorting the values by rating 
rt=rt.sort_values(by="Rating",ascending=False)

#taking only 5 ratings
rt_people=rt[rt["Rating"]==5.0]

#sorting the ratings by people count
rt_top20=rt_people.sort_values(by="No of people Rated",ascending=False)

rt_top20.head(20).plot.barh(x='Name', y='No of people Rated',figsize=(9, 10), rot=0)
plt.xlabel("No of people Rated")
plt.ylabel("Name of the App", labelpad=14)
plt.title("\n\nTop 20 App with High number of users and ratings", y=1.02);

> We have sorted the top 20 Apps based on two parameters

* No of people Rated
* Rating

> We can also get the top 20 app based on the Rating parameter,
but ratings are Average of multiple user's rating,  few users may be given some positive rating it has only a few users though so that we could not be sure by that value, so we have to take a high number of user ratings and high rating value. 

In [None]:
#preprocessing the price column
ws["Price"]=ws["Price"].str.replace("Free","0")
ws["Price"]=ws["Price"].str.replace("₹ ","")
ws["Price"]=ws["Price"].str.replace(",","")
ws["Price"]=ws["Price"].astype("float")

pr=ws["Price"].sort_values()
free=0
cost=0
for i in pr:
    if i==0.0:
        free+=1
    else:
        cost+=1
#Total sum of app price based on free and cost.
top=[('Free',free),('Cost',cost)]

labels, ys = zip(*top)
xs = np.arange(len(labels)) 
width = 1

plt.bar(xs, ys, width, align='center', color=("blue","orange"))
plt.title("Count of free and cost Apps")
plt.xticks(xs, labels) 
plt.yticks(ys)


> Free Apps are very high. 
<br>
<br>
`(158/5322)*100
  = 2.9688087185268697
`
<br><br>
> 2.96% of apps are only cost

In [None]:
# Converting Date column values to Date format
ws['Date']= pd.to_datetime(ws['Date'], format="%d-%m-%Y")

# soring the values based on Date
dte = ws.sort_values(by='Date', ascending=True)

# Setting index as the date 
dte.index = dte.Date

# Resampling the data based on the year
yr=dte.resample('Y').mean()

# Setting fiqure size
sns.set(rc={'figure.figsize':(10,5)})
sns.barplot(x=yr.index.year, y="No of people Rated", data=yr)
plt.xlabel("Year", labelpad=17)
plt.ylabel("Number of users", labelpad=14)

plt.title("Number of people rated yearly - Average ", y=1.01);

> In this we have Number of users rating on yearly the Average, for this sort dataset we can see that number of users who have been giving rating is being decreased.

In [None]:
sns.set(rc={'figure.figsize':(18,5)})
sns.boxplot(x="Rating",y="No of people Rated",data=ws,hue="Category")

# adding legend outside of the plot
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)


> Here in the box plot, we can see that rating does not have been given by the category wise, for example, only Music category users are only given rating 5, here no such case like that, it is given fairly given to all ratings to all sort of category.

In [None]:
sns.boxplot(x="Rating",y="Price",data=ws)
plt.title("Rating vs Price")