As a part of my Data Analytics Internship at Oasis Infobyte spanning 1st May,2024 till 5th June,2024, I completed 3 projects :
- Exploratory Data Analysis (EDA) on Retail Sales Data
- Data Cleaning on New York City Airbnb Listings
- Analyzing Google Play Store Data
As a part of my Data Analytics Internship at Oasis Infobyte, I have used a dataset from Kaggle (public domain) related to the Retail Sales Data collected from 10 different shopping malls in Istanbul, Turkey.
Link to the dataset: https://www.kaggle.com/datasets/mehmettahiraslan/customer-shopping-dataset
- invoice_no: invoice number
- customer_id
- gender: either male or female
- age
- category: such as clothing,shoes, cosmetics, et cetera
- quantity
- price: this is price per unit item
- payment method: either cash or debit card or credit card
- shopping_mall: name of the mall
- Data Loading and Cleaning
- Feature Engineering
- Descriptive Statistics
- Data Visualization
- Data Analysis
- Recommendations
This step involved:
- Loading the dataset into a Pandas dataframe
- handling missing values
- removing duplicates
- correcting data types
- formatting the dates
This step involved creating columns based on:
- categorization of malls such as Luxury mall, mega mall, mixed-use mall and outlet mall
- location of mall - whether the mall is located on the European side of Istanbul or on the Asian side of Istanbul.
- total price of items
- categorization of ages into different age groups of baby, young adult, middle aged adult and old adults.
- total price of items purchased by each gender
- total price of items purchased by each age group
- total price of items purchased in each category of items
- total price of items purchased through each payment method
- total price of items purchased in each shopping mall
- total price of items purchased in each mall location
- total price of items purchased in each type of mall
- unit price of items purchased by each gender
- total quantity of items purchased by each gender
- total quantity of items purchased from each category
- total quantity of items purchased through each payment method
- total quantity of items purchased in each shopping mall
- total quantity of items purchased in each type of mall
- total quantity of items purchased in each mall location
- total quantity of items purchased by each age group
This step involved getting KPI's from the crucial columns using measures of central tendency. Here are a few important bits of information derived from this step:
- The average age of a person shopping in the malls is 43 years
- On an average, a customer buys about 3 items of any type from the shopping mall
- On an average, a person spends 689.26 Turkish Liras per item in a shopping mall
- The number of purchases done by females are higher than those of males.
- The payment method for most invoices is Cash
- Majority of the invoices come from malls located on the European side of Istanbul
In this step, I utilized many different types of graphs such as bar charts and pie charts using matplotlib and seaborn libraries to explore and understand the data.
Few of the most pivotal graphs I made at this stage were:
- Comparison of Payment Methods
- Total Amount spent
- Purchase trends in each month
In this stage, I analyzed the graphs and descriptive statistics to make the following inferences about the data:
- Customers of age 37 years have the most invoices in shopping malls
- On an average, a customer buys about 3 items of any type from the shopping mall
- On an average, a person spends 689.26 Turkish Liras per item in a shopping mall
- The maximum number of invoices are from the Mall of Istanbul
- More invoices have been issued to females than to males thereby more amount of money is spent and more quantity of items are purchased by female customers.
- Both male and female customers visit Luxury and Mega Malls, and the malls in European part of Istanbul more often.
- Young adults ( 3 years to 39 years) have more invoices as compared to middle aged and old adults hence the total price and quantity of items purchased is higher.
- Clothing emerges as the most popular category among all the invoices, followed by cosmetics and food/beverages.
- Clothing and shoes, followed by technology items have the highest total price.
- Price per item is the highest for technology category , followed by shoes and clothing.
- Customers spend more in the month of October. There is a sharp decrease in the total price in November followed by a slight increase in December. A rough estimation can be made that males shop the most in September, whereas female customers shop more in July and October
- Most of the payments have been done in cash followed by credit card.
- More number of invoices have been issued by Kanyon and Mall of Istanbul than other malls.
- The invoices are dominated by purchases from mega malls
- Most of the invoices are from malls located in the European part of Istanbul
Here are my top recommendations based on the insights gathered:
- Malls should locate themselves in the European part of Istanbul and update their infrastructure to become Luxury or Mega Malls.
- Malls should focus more on their young adult(3 years to 39 years) customers.
- Malls should also focus more on their female customers as the number of invoices billed by females, the total price and quantity of items purchased by females is higher than that of males.
- Malls should make adequate arrangements during the month of October as it is the highest billing month.
- Malls should make sure that a customer can easily purchase items using cash.
- Malls should focus more on selling clothing items to customers.
As a part of my Data Analytics Internship at Oasis Infobyte, I have used a dataset from Kaggle (public domain) related to the AirBnB listings and metrics in New York City in the year 2019 to perform extensive data cleaning.
Link to the dataset: https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data/data
- id: Listing ID
- name: name/description of the airbnb listing
- host_id
- host_name
- neighbourhood_group: location of the listing
- neighbourhood: area
- latitude
- longitude
- room_type: listing space type
- price: in US dollars
I have performed data cleaning in the following steps:
- Initial Data Exploration
- Imputation
- Date Parsing
- Outlier Detection and Handling
- Text preprocessing
- Label Encoding
- Column Removal
- Min Max Scaling
From this section of initial data exploration , I imported all the necessary libraries and observed the following points:
- name, neighbourhood group, last_review and reviews_per_month columns have NaN values.
- There are no duplicate values in the dataset
In the dataset, name, neighbourhood_group, last_review and reviews_per_month column had several NaN values before this section I imputed the NaN values in these columns by making use of statistical functions such as mean and mode.
As a precaution, I brought the dates contained in the last_review column into a common format
Here are a few points I noted in this section:
- Outliers can cause problems in model building so I decided to remove them from the dataset
- I utilized z score technique to detect outliers in latitude, longitude and price columns as they are the most crucial columns to the analysis.
- If z score lies between -3 and +3 then the data point is not an outlier otherwise it is classified as an outlier.
- I created columns of z_latitude and z_longitude to store the z-scores of latitude and longitude columns respectively.
- I removed the rows containing z_latitude > 3 and z_latitude<-3
- There were no rows for z_longitude<-3.
- I removed the rows where z_longitude>3
- I created columns of z_price to store the z-scores of price column .
- Since most of the outliers detected through the price column have names such as "luxury" and "trendy" and might have a reason for their high price, I will be keeping the outliers in the price columns as it is and not deal with them.
- There were no rows for z_price<-3
In this section, I have dealth with cleaning the text based columns such as name, host_name, neighbourhood_group and neighbourhood in the following manner:
- I removed stopwords from the columns to concentrate on the more important words
- I also included a punctuation removal function
- I converted all these textual columns into lower case for easy handling.
In this section, I removed some columns of id, host_name, z_latitude, z_longitude and z_price as they dont play a major role in the upcoming cleaning process and have less contribution in the visualization process.
In this section, I focussed on converting many of the text based columns into numeric type using label encoding method to help in easy handling.
In the final section, I performed min max scaling on numeric columns to prevent any column from dominating the machine learning algorithm.
- I had initially divided the dataset into scaling and non-scaling features.
- Then I performed min max scaling on scaling features and stored it in scaled features
- I created new data frames from scaled features and non-scaling features then concatenated them into the final cleaned dataset 'df'.
In this project done as a part of my Data Analytics Internship at Oasis Infobyte, I have performed an in-depth analysis of Google Play Store data. This analysis aims to understand the factors that contribute to an app's success, such as user ratings, number of installs, and category trends. By uncovering these insights, developers and businesses can optimize their app offerings.
Link to the datasets: https://www.kaggle.com/datasets/utshabkumarghosh/android-app-market-on-google-play
I have used two datasets from Kaggle related to over 9000 apps from 20+ categories, along with the reviews of apps for this analysis. The datasets I have used are :
-
apps.csv: It contains 14 columns pertaining to the apps listed on Google Play Store
-
user_reviews.csv: It contains 5 columns pertaining to the reviews received by each app and the corresponding sentiment.
The columns present in apps.csv are as follows:
- Unnamed: 0
- App
- Category
- Rating
- Reviews
- Size
- Installs
- Type
- Price
- Content Rating
- Genres
- Last Updated
- Current Ver
- Android Ver
The columns present in user_reviews.csv are as follows:
- App
- Translated_Review
- Sentiment
- Sentiment_Polarity
- Sentiment_Subjectivity
I have followed the steps given below to analyze this data:
- Initial Data Exploration
- Data Cleaning
- Feature Engineering
- Descriptive Statistics
- Data Visualization
- Data Analysis
I loaded the two datasets under consideration and stored them in pandas dataframes . I also imported the necessary libraries.
In this section, I implemented the following steps :
- Handled missing values in critical columns like Rating and Reviews.
- Converted data types for columns like Installs and Price.
- Removed punctuation from text based columns such as category and Genres
- Removed stopwords from text based columns such as Translated Reviews
In this section, I created new columns in both the dataframes such as opinion type, average price of apps per category , average number of reviews per category and so on.
This step involved getting KPI's from the crucial columns using measures of central tendency. Here are a few important bits of information derived from this step:
- Most of the apps belong to the "free" category
- The average rating of apps is 4.17
- The average size of each app listed is 20.41 MB
- Most of the reviews are positive
In this step, I utilized many different types of graphs using matplotlib , seaborn and wordcloud libraries to explore and understand the data.
Few of the most pivotal graphs I made at this stage were:
- Distribution of ratings
- Count of Categories of apps
- Content Ratings of apps
Here are the top 15 insights I have gathered from the visualizations:
- Majority of the apps belong to the "family" category followed by games and tools.
- The average rating of apps is 4.17
- The average size of each app listed is 20.41 MB
- Finance, followed by lifestyle apps had the highest average prices among all categories
- Art and design and education categories have the highest average ratings among all categories
- Gaming apps, followed by family apps have the highest average size per category.
- Communication, followed by video players category has the highest average number of installations among all categories
- Majority of the apps listed are free to use
- Free apps occupy slightly more size on an average as compared to paid apps
- Average number of installations for free apps is far greater than the number of installations of paid apps
- Most apps have "everyone" or suitable for all ages rating.
- Tools, education and entertainment are the most commonly used words to describe the genre of apps
- Most of the reviews evoke a positive sentiment
- Most of the reviews are subjective in nature
- Good, game, great are some of the most commonly used words in the reviews of apps