Hi everyone, this is part 2 of the series "Window Store Explonatory Data Analysis (EDA)". If you haven't checked out part 2 already then please use this link to access the article:

https://www.kaggle.com/sanchitagarwal/windows-store-eda-1

A quick look at the data revealed that the majority of the data (97%) are free apps (price = 0), so I decided to stratify the dataset into categories "Free" and "Paid" based on the value of the PRICE column; The intutition being that the behviour of the population intenting to purchase an application and the resultant characterstics will be significantly different than those who aren't. Nevertheless, later on, I shall seek correlation between both the stratas.

Lets start with the "Paid" category first.

<b>P.S: SQL related code have been commented and replaced with Kaggle appropriate code.</b>

In [None]:
#import pymssql
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import dates
%matplotlib inline
import seaborn as sns

In [None]:
"""server = "You seriously thought"
user = "that I'm stupid enough to"
password = "expose my credentials to the public ? "

connection = pymssql.connect(server, user, password, "master")

with connection:
    with connection.cursor(as_dict = True) as cursor:
        sql = "SELECT * FROM windows_store WHERE price != 0.0"
        cursor.execute(sql)
        result = cursor.fetchall()
        #print(result)
        paid_apps_df = pd.DataFrame(result)"""

apps_df = pd.read_csv("../input/windows-store/msft.csv")
apps_df["Price"].fillna(value = "Free", inplace = True)

paid_apps_df  = apps_df.loc[apps_df.Price != "Free"]
paid_apps_df['Date'] = pd.to_datetime(paid_apps_df['Date'])  

paid_apps_df['Price'] = paid_apps_df['Price'].replace("[\₹,\,]","",regex=True).astype(float)
paid_apps_df.head()

<b>P.S</b>: Its important to convert the DATE column into pandas datetime datatype to avoid any conflicts when doing datetime analysis through Pandas.

In [None]:
paid_apps_df.describe()

In the RATING column we can see that the average rating for paid apps is a mere 2.47, a horrondous result considering that one would expect a better user experience on apps which have been paid for. 

In the PRICE column, the range is pretty large (5449 - 54 = 5395), and so is the standard deviation, meaning there is a lot of variance in the prices. Interesting thing to know is that both the 50th & 75th percentile equals 269, meaning that 269 is the most commonly used to price apps. This can be verified by finding the mode of the PRICE column. 


In [None]:
paid_apps_df["Price"].mode()

Voila!!

In [None]:
paid_apps_df.Category.describe()

In [None]:
paid_apps_df.Category.unique()

"Book" being the top category in paid apps confirms books to be the most popular product a person is willing to pay for, though abeit by a very small margin (56/158 = 0.3544 * 100 = 35.44%). Also to note that there are 8 records with NULL entry for the CATEGORY column.

Now lets do some plotting!

In [None]:
df_plot_count = paid_apps_df[["Rating", "Date"]].groupby(pd.Grouper(key="Date", freq='Y')).count()
#print(df_plot_count)
plt.figure(figsize=(15,6))
plt.title("Paid Apps launched per year")

plot = sns.lineplot(data=df_plot_count)
plot.xaxis.set_major_formatter(dates.DateFormatter("%Y"))
plot.set(ylabel = "Count")
plot.legend(labels=["Count"])

plt.show()

This graph shows that the number of paid apps launched is increasing almost exponentially. This tells us the increase in popularity of Windows app ecosystem. 

I am interested in knowing if there is any seasonality trend amongest PRICE, RATING and NO_OF_PEOPLE.

In [None]:
df_plot = paid_apps_df[["Date", "Rating", 'No of people Rated', "Price"]]
df_plot.set_index('Date', inplace=True)

df_plot = df_plot.groupby(pd.Grouper(freq='M')).mean().dropna(how="all")

plt.figure(figsize=(15,6))

plt.title("PRICE Seasonality analysis")
plot = sns.lineplot(data= df_plot["Price"])
plot.set(ylabel = "Price")
plot.xaxis.set_minor_locator(dates.MonthLocator())
plot.xaxis.set_major_formatter(dates.DateFormatter("%Y"))

plt.show()

From 2011 to 2013, the increase was almost linear but later on there are sudden surges followed by a decrease, like a sine wave.

Also to note that for years 2013, 2015, 2017, 2018 and 2020, surge in price are detected in the first months (Jan, Feb, March). What could cause this pattern to reemerge ? Something to ponder about.

Lastly, from 2020 onwards, the prices are in the low territory compared to previous year's. I believe this can be attributed to the change in pricing strategy (where apps are free to download with additional content available to purchase) and/or increase in piracy.  

In [None]:
plt.figure(figsize=(15,6))

plt.title("RATING seasonality analysis")
plot = sns.lineplot(data= df_plot[["Rating"]])
plot.xaxis.set_minor_locator(dates.MonthLocator())
plot.set(ylabel = "Rating")
plot.xaxis.set_major_formatter(dates.DateFormatter("%Y"))

plt.show()

We can see that there is an increase in volatility as we move further down the x-axis. This could be attributed to the increase in number of apps published through the years as explored in the first graph.

In [None]:
plt.figure(figsize=(15,6))

plt.title("No of people seasonality analysis")
plot = sns.lineplot(data= df_plot[["No of people Rated"]])
plot.xaxis.set_minor_locator(dates.MonthLocator())
plot.xaxis.set_major_formatter(dates.DateFormatter("%Y"))

plt.show()

This graph and the previous graph are almost similar (which makes perfect sense since the set of people who rated the apps is subset of set of people who have downloaded the app, i.e not all people who downloaded the app may have rated the app but those who have rated the app have definitely downloaded it!).

Nothing new to explore here.

Now lets find any correlation between PRICE and RATING.

In [None]:
plt.figure(figsize=(15,6))

plt.title("Correlation between Price & Rating")

plot = sns.scatterplot(y=paid_apps_df["Price"], x=paid_apps_df["Rating"])

plt.show()

For the 4 rating, the range of price is the largest, while the perfect 5 rating has been achieved by relatively inexpensive apps. This affirms the fact that the mere price tag of an application is not the sole indication of the user experience. This is further stated by the 3 rating achieved by the humongously priced "5449" app. 

Thats it for now. In the next article I will continue with my analysis of the "Free apps" strata of the dataset. Please don't hesitate to share your feedbacks. Thanks!