Oasis Infobyte Data Analytics Internship

As a part of my Data Analytics Internship at Oasis Infobyte spanning 1st May,2024 till 5th June,2024, I completed 3 projects :

Exploratory Data Analysis (EDA) on Retail Sales Data
Data Cleaning on New York City Airbnb Listings
Analyzing Google Play Store Data

EDA ON Retail Sales Data

As a part of my Data Analytics Internship at Oasis Infobyte, I have used a dataset from Kaggle (public domain) related to the Retail Sales Data collected from 10 different shopping malls in Istanbul, Turkey.

Columns in the dataset

Link to the dataset: https://www.kaggle.com/datasets/mehmettahiraslan/customer-shopping-dataset

invoice_no: invoice number
customer_id
gender: either male or female
age
category: such as clothing,shoes, cosmetics, et cetera
quantity
price: this is price per unit item
payment method: either cash or debit card or credit card
shopping_mall: name of the mall

Project Structure

Data Loading and Cleaning
Feature Engineering
Descriptive Statistics
Data Visualization
Data Analysis
Recommendations

Data Loading and Cleaning

This step involved:

Loading the dataset into a Pandas dataframe
handling missing values
removing duplicates
correcting data types
formatting the dates

Feature Engineering

This step involved creating columns based on:

categorization of malls such as Luxury mall, mega mall, mixed-use mall and outlet mall
location of mall - whether the mall is located on the European side of Istanbul or on the Asian side of Istanbul.
total price of items
categorization of ages into different age groups of baby, young adult, middle aged adult and old adults.
total price of items purchased by each gender
total price of items purchased by each age group
total price of items purchased in each category of items
total price of items purchased through each payment method
total price of items purchased in each shopping mall
total price of items purchased in each mall location
total price of items purchased in each type of mall
unit price of items purchased by each gender
total quantity of items purchased by each gender
total quantity of items purchased from each category
total quantity of items purchased through each payment method
total quantity of items purchased in each shopping mall
total quantity of items purchased in each type of mall
total quantity of items purchased in each mall location
total quantity of items purchased by each age group

Descriptive Statistics

This step involved getting KPI's from the crucial columns using measures of central tendency. Here are a few important bits of information derived from this step:

The average age of a person shopping in the malls is 43 years
On an average, a customer buys about 3 items of any type from the shopping mall
On an average, a person spends 689.26 Turkish Liras per item in a shopping mall
The number of purchases done by females are higher than those of males.
The payment method for most invoices is Cash
Majority of the invoices come from malls located on the European side of Istanbul

Data Visualization

In this step, I utilized many different types of graphs such as bar charts and pie charts using matplotlib and seaborn libraries to explore and understand the data.

Few of the most pivotal graphs I made at this stage were:

Comparison of Payment Methods
Total Amount spent
Purchase trends in each month

Data Analysis

In this stage, I analyzed the graphs and descriptive statistics to make the following inferences about the data:

Customers of age 37 years have the most invoices in shopping malls
On an average, a customer buys about 3 items of any type from the shopping mall
On an average, a person spends 689.26 Turkish Liras per item in a shopping mall
The maximum number of invoices are from the Mall of Istanbul
More invoices have been issued to females than to males thereby more amount of money is spent and more quantity of items are purchased by female customers.
Both male and female customers visit Luxury and Mega Malls, and the malls in European part of Istanbul more often.
Young adults ( 3 years to 39 years) have more invoices as compared to middle aged and old adults hence the total price and quantity of items purchased is higher.
Clothing emerges as the most popular category among all the invoices, followed by cosmetics and food/beverages.
Clothing and shoes, followed by technology items have the highest total price.
Price per item is the highest for technology category , followed by shoes and clothing.
Customers spend more in the month of October. There is a sharp decrease in the total price in November followed by a slight increase in December. A rough estimation can be made that males shop the most in September, whereas female customers shop more in July and October
Most of the payments have been done in cash followed by credit card.
More number of invoices have been issued by Kanyon and Mall of Istanbul than other malls.
The invoices are dominated by purchases from mega malls
Most of the invoices are from malls located in the European part of Istanbul

Recommendations

Here are my top recommendations based on the insights gathered:

Malls should locate themselves in the European part of Istanbul and update their infrastructure to become Luxury or Mega Malls.
Malls should focus more on their young adult(3 years to 39 years) customers.
Malls should also focus more on their female customers as the number of invoices billed by females, the total price and quantity of items purchased by females is higher than that of males.
Malls should make adequate arrangements during the month of October as it is the highest billing month.
Malls should make sure that a customer can easily purchase items using cash.
Malls should focus more on selling clothing items to customers.

LinkedIn : https://www.linkedin.com/posts/amruha-ahmed_oasisinfobyte-intern-internship-activity-7194752879071571969-efTM?utm_source=share&utm_medium=member_desktop

Data Cleaning on New York City Airbnb Listings

As a part of my Data Analytics Internship at Oasis Infobyte, I have used a dataset from Kaggle (public domain) related to the AirBnB listings and metrics in New York City in the year 2019 to perform extensive data cleaning.

Columns in the dataset

Link to the dataset: https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data/data

id: Listing ID
name: name/description of the airbnb listing
host_id
host_name
neighbourhood_group: location of the listing
neighbourhood: area
latitude
longitude
room_type: listing space type
price: in US dollars

Project Structure

I have performed data cleaning in the following steps:

Initial Data Exploration
Imputation
Date Parsing
Outlier Detection and Handling
Text preprocessing
Label Encoding
Column Removal
Min Max Scaling

Initial Data Exploration

From this section of initial data exploration , I imported all the necessary libraries and observed the following points:

name, neighbourhood group, last_review and reviews_per_month columns have NaN values.
There are no duplicate values in the dataset

Imputation

In the dataset, name, neighbourhood_group, last_review and reviews_per_month column had several NaN values before this section I imputed the NaN values in these columns by making use of statistical functions such as mean and mode.

Date Parsing

As a precaution, I brought the dates contained in the last_review column into a common format

Outlier Detection and Handling

Here are a few points I noted in this section:

Outliers can cause problems in model building so I decided to remove them from the dataset
I utilized z score technique to detect outliers in latitude, longitude and price columns as they are the most crucial columns to the analysis.
If z score lies between -3 and +3 then the data point is not an outlier otherwise it is classified as an outlier.
I created columns of z_latitude and z_longitude to store the z-scores of latitude and longitude columns respectively.
I removed the rows containing z_latitude > 3 and z_latitude<-3
There were no rows for z_longitude<-3.
I removed the rows where z_longitude>3
I created columns of z_price to store the z-scores of price column .
Since most of the outliers detected through the price column have names such as "luxury" and "trendy" and might have a reason for their high price, I will be keeping the outliers in the price columns as it is and not deal with them.
There were no rows for z_price<-3

Text preprocessing

In this section, I have dealth with cleaning the text based columns such as name, host_name, neighbourhood_group and neighbourhood in the following manner:

I removed stopwords from the columns to concentrate on the more important words
I also included a punctuation removal function
I converted all these textual columns into lower case for easy handling.

Column Removal

In this section, I removed some columns of id, host_name, z_latitude, z_longitude and z_price as they dont play a major role in the upcoming cleaning process and have less contribution in the visualization process.

Label Encoding

In this section, I focussed on converting many of the text based columns into numeric type using label encoding method to help in easy handling.

Min Max Scaling

In the final section, I performed min max scaling on numeric columns to prevent any column from dominating the machine learning algorithm.

I had initially divided the dataset into scaling and non-scaling features.
Then I performed min max scaling on scaling features and stored it in scaled features
I created new data frames from scaled features and non-scaling features then concatenated them into the final cleaned dataset 'df'.

LinkedIn : https://www.linkedin.com/posts/amruha-ahmed_data-oasisinfobyte-intern-activity-7194780743552679936-T5Am?utm_source=share&utm_medium=member_desktop

Analyzing Google Play Store Data

In this project done as a part of my Data Analytics Internship at Oasis Infobyte, I have performed an in-depth analysis of Google Play Store data. This analysis aims to understand the factors that contribute to an app's success, such as user ratings, number of installs, and category trends. By uncovering these insights, developers and businesses can optimize their app offerings.

Datasets used

Link to the datasets: https://www.kaggle.com/datasets/utshabkumarghosh/android-app-market-on-google-play

I have used two datasets from Kaggle related to over 9000 apps from 20+ categories, along with the reviews of apps for this analysis. The datasets I have used are :

apps.csv: It contains 14 columns pertaining to the apps listed on Google Play Store
user_reviews.csv: It contains 5 columns pertaining to the reviews received by each app and the corresponding sentiment.

Columns in the Datasets

The columns present in apps.csv are as follows:

Unnamed: 0
App
Category
Rating
Reviews
Size
Installs
Type
Price
Content Rating
Genres
Last Updated
Current Ver
Android Ver

The columns present in user_reviews.csv are as follows:

App
Translated_Review
Sentiment
Sentiment_Polarity
Sentiment_Subjectivity

Project Structure

I have followed the steps given below to analyze this data:

Initial Data Exploration
Data Cleaning
Feature Engineering
Descriptive Statistics
Data Visualization
Data Analysis

Initial Data Exploration

I loaded the two datasets under consideration and stored them in pandas dataframes . I also imported the necessary libraries.

Data Cleaning

In this section, I implemented the following steps :

Handled missing values in critical columns like Rating and Reviews.
Converted data types for columns like Installs and Price.
Removed punctuation from text based columns such as category and Genres
Removed stopwords from text based columns such as Translated Reviews

Feature Engineering

In this section, I created new columns in both the dataframes such as opinion type, average price of apps per category , average number of reviews per category and so on.

Descriptive Statistics

This step involved getting KPI's from the crucial columns using measures of central tendency. Here are a few important bits of information derived from this step:

Most of the apps belong to the "free" category
The average rating of apps is 4.17
The average size of each app listed is 20.41 MB
Most of the reviews are positive

Data Visualization

In this step, I utilized many different types of graphs using matplotlib , seaborn and wordcloud libraries to explore and understand the data.

Few of the most pivotal graphs I made at this stage were:

Distribution of ratings
Count of Categories of apps
Content Ratings of apps

Data Analysis

Here are the top 15 insights I have gathered from the visualizations:

Majority of the apps belong to the "family" category followed by games and tools.
The average rating of apps is 4.17
The average size of each app listed is 20.41 MB
Finance, followed by lifestyle apps had the highest average prices among all categories
Art and design and education categories have the highest average ratings among all categories
Gaming apps, followed by family apps have the highest average size per category.
Communication, followed by video players category has the highest average number of installations among all categories
Majority of the apps listed are free to use
Free apps occupy slightly more size on an average as compared to paid apps
Average number of installations for free apps is far greater than the number of installations of paid apps
Most apps have "everyone" or suitable for all ages rating.
Tools, education and entertainment are the most commonly used words to describe the genre of apps
Most of the reviews evoke a positive sentiment
Most of the reviews are subjective in nature
Good, game, great are some of the most commonly used words in the reviews of apps

LinkedIn : https://www.linkedin.com/posts/amruha-ahmed_oasisinfobyte-intern-internship-activity-7194788290175115264-cego?utm_source=share&utm_medium=member_desktop

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
AB_NYC_2019.csv		AB_NYC_2019.csv
Level_1_Project_1 EDA on Retail Sales Data.ipynb		Level_1_Project_1 EDA on Retail Sales Data.ipynb
Level_1_Project_3 Data Cleaning on New York City Airbnb Listings.ipynb		Level_1_Project_3 Data Cleaning on New York City Airbnb Listings.ipynb
Level_2_Project_4_Analyzing_Google_Play_Store_Data.ipynb		Level_2_Project_4_Analyzing_Google_Play_Store_Data.ipynb
OIBSIP Level 1 Project 1 video.mp4		OIBSIP Level 1 Project 1 video.mp4
OIBSIP Level 1 Project 3 Video.mp4		OIBSIP Level 1 Project 3 Video.mp4
OIBSIP Level 2 Project 4 video.mp4		OIBSIP Level 2 Project 4 video.mp4
README.md		README.md
apps.csv		apps.csv
customer_shopping_data.csv		customer_shopping_data.csv
user_reviews.csv		user_reviews.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Oasis Infobyte Data Analytics Internship

EDA ON Retail Sales Data

Columns in the dataset

Project Structure

Data Loading and Cleaning

Feature Engineering

Descriptive Statistics

Data Visualization

Data Analysis

Recommendations

Data Cleaning on New York City Airbnb Listings

Columns in the dataset

Project Structure

Initial Data Exploration

Imputation

Date Parsing

Outlier Detection and Handling

Text preprocessing

Column Removal

Label Encoding

Min Max Scaling

Analyzing Google Play Store Data

Datasets used

Columns in the Datasets

Project Structure

Initial Data Exploration

Data Cleaning

Feature Engineering

Descriptive Statistics

Data Visualization

Data Analysis

About

Releases

Packages

Languages

AmruhaAhmed/OIBSIP

Folders and files

Latest commit

History

Repository files navigation

Oasis Infobyte Data Analytics Internship

EDA ON Retail Sales Data

Columns in the dataset

Project Structure

Data Loading and Cleaning

Feature Engineering

Descriptive Statistics

Data Visualization

Data Analysis

Recommendations

Data Cleaning on New York City Airbnb Listings

Columns in the dataset

Project Structure

Initial Data Exploration

Imputation

Date Parsing

Outlier Detection and Handling

Text preprocessing

Column Removal

Label Encoding

Min Max Scaling

Analyzing Google Play Store Data

Datasets used

Columns in the Datasets

Project Structure

Initial Data Exploration

Data Cleaning

Feature Engineering

Descriptive Statistics

Data Visualization

Data Analysis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages