# Google Play Store Data Analysis (EDA using Python)

**Project:** Exploratory Data Analysis of Google Play Store apps dataset using Python, Pandas, NumPy, and Matplotlib.

Place the dataset file `googleplaystore.csv` in the same folder as this notebook before running. The notebook is structured and commented so recruiters or interviewers can follow your process.

---

## 1. Libraries & Settings
Import required libraries and basic settings.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Matplotlib default settings (do not set custom colors unless requested)
plt.rcParams['figure.figsize'] = (10,6)
plt.rcParams['axes.grid'] = True

print('Libraries imported')

## 2. Load dataset
The notebook expects a `googleplaystore.csv` file. If you have a differently named file, change the path below.

In [None]:
DATA_PATH = 'googleplaystore.csv'

if not os.path.exists(DATA_PATH):
    print(f"Dataset not found at {DATA_PATH}. Please upload the CSV into the notebook folder and re-run this cell.")
else:
    df = pd.read_csv(DATA_PATH)
    print('Loaded dataset with shape:', df.shape)
    display(df.head())

## 3. Initial inspection
Check columns, dtypes, missing values and quick stats.

In [None]:
if 'df' in globals():
    display(df.info())
    display(df.describe(include='all').T)
    missing = df.isnull().sum().sort_values(ascending=False)
    display(missing[missing>0])
else:
    print('Load the dataset first.')

## 4. Data cleaning & transformation
Common cleaning steps used in the project. Each step is reversible via copying the original dataset or by re-running the cell sequence.

In [None]:
def clean_googleplay(df):
    df = df.copy()
    # Standard column name cleanup
    df.columns = [c.strip() for c in df.columns]

    # Remove duplicates based on 'App' column if exists
    if 'App' in df.columns:
        before = df.shape[0]
        df = df.drop_duplicates(subset=['App'], keep='first')
        after = df.shape[0]
        print(f'Removed {before-after} duplicate rows based on App column')

    # Clean 'Installs' column -> numeric
    if 'Installs' in df.columns:
        df['Installs'] = df['Installs'].astype(str).str.replace('[+,]', '', regex=True).str.replace('Free', '')
        df['Installs'] = pd.to_numeric(df['Installs'], errors='coerce')

    # Clean 'Price' column -> numeric
    if 'Price' in df.columns:
        df['Price'] = df['Price'].astype(str).str.replace('[$,]', '', regex=True)
        df['Price'] = pd.to_numeric(df['Price'], errors='coerce')

    # Convert 'Reviews' to numeric
    if 'Reviews' in df.columns:
        df['Reviews'] = pd.to_numeric(df['Reviews'], errors='coerce')

    # Clean 'Size' column: handle 'M', 'k' and 'Varies with device'
    if 'Size' in df.columns:
        def parse_size(x):
            x = str(x).strip()
            if x in ['Varies with device', 'nan', 'None']:
                return np.nan
            if x.endswith('M'):
                try:
                    return float(x[:-1]) * 1e6
                except:
                    return np.nan
            if x.endswith('k'):
                try:
                    return float(x[:-1]) * 1e3
                except:
                    return np.nan
            try:
                return float(x)
            except:
                return np.nan
        df['Size_bytes'] = df['Size'].apply(parse_size)

    # Ratings to numeric
    if 'Rating' in df.columns:
        df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

    # Last Updated -> datetime
    if 'Last Updated' in df.columns:
        df['Last Updated'] = pd.to_datetime(df['Last Updated'], errors='coerce')

    return df

# Apply cleaning if dataset loaded
if 'df' in globals():
    df_clean = clean_googleplay(df)
    print('Cleaning complete. Clean shape:', df_clean.shape)
    display(df_clean.head())
else:
    print('Load the dataset first.')

### 4.1 Handling missing values
Examples: drop rows with no App name, impute Rating with median, etc.

In [None]:
if 'df_clean' in globals():
    df2 = df_clean.copy()
    # Drop rows with missing App names
    if 'App' in df2.columns:
        df2 = df2[~df2['App'].isnull()]

    # Impute Rating with median
    if 'Rating' in df2.columns:
        median_rating = df2['Rating'].median()
        df2['Rating_imputed'] = df2['Rating'].fillna(median_rating)
        print('Imputed Rating nulls with median:', median_rating)

    # Fill installs nulls with 0 (or keep as NaN based on your preference)
    if 'Installs' in df2.columns:
        df2['Installs'] = df2['Installs'].fillna(0)

    display(df2.isnull().sum().sort_values(ascending=False).head(10))
else:
    print('Run cleaning first.')

## 5. Feature engineering
Create useful features like 'is_free', 'category_encoded', 'days_since_update', 'log_installs'.

In [None]:
if 'df2' in globals():
    df3 = df2.copy()
    if 'Type' in df3.columns:
        df3['is_free'] = df3['Type'].str.lower().eq('free')
    else:
        # Some datasets have Price column only
        if 'Price' in df3.columns:
            df3['is_free'] = df3['Price'].fillna(0) == 0

    if 'Installs' in df3.columns:
        # Add log installs to reduce skew
        df3['log_installs'] = df3['Installs'].apply(lambda x: np.log1p(x))

    if 'Category' in df3.columns:
        df3['Category'] = df3['Category'].astype(str)
        df3['Category_code'] = df3['Category'].astype('category').cat.codes

    if 'Last Updated' in df3.columns:
        df3['days_since_update'] = (pd.Timestamp.today() - df3['Last Updated']).dt.days

    display(df3.head())
else:
    print('Run previous steps first.')

## 6. Exploratory Data Analysis (EDA)
Several visualizations: distributions, bar plots, scatter plots and boxplots.
Each plot uses matplotlib only and is independent (no seaborn).

In [None]:
if 'df3' in globals():
    # Rating distribution
    if 'Rating_imputed' in df3.columns:
        plt.figure()
        plt.hist(df3['Rating_imputed'].dropna(), bins=20)
        plt.title('Distribution of App Ratings (imputed)')
        plt.xlabel('Rating')
        plt.ylabel('Count')
        plt.show()

    # Top categories by number of apps
    if 'Category' in df3.columns:
        top_cats = df3['Category'].value_counts().nlargest(10)
        plt.figure()
        top_cats.plot.bar()
        plt.title('Top 10 App Categories by Count')
        plt.xlabel('Category')
        plt.ylabel('Number of Apps')
        plt.show()

    # Installs vs Rating scatter
    if 'Installs' in df3.columns and 'Rating_imputed' in df3.columns:
        plt.figure()
        plt.scatter(df3['Installs'], df3['Rating_imputed'], alpha=0.4)
        plt.xscale('log')
        plt.title('Installs vs Rating')
        plt.xlabel('Installs (log scale)')
        plt.ylabel('Rating')
        plt.show()

    # Boxplot: Rating by Type (Free/Paid)
    if 'is_free' in df3.columns:
        plt.figure()
        groups = [df3[df3['is_free']==val]['Rating_imputed'].dropna() for val in [True, False]]
        plt.boxplot(groups, labels=['Free','Paid'])
        plt.title('Rating distribution: Free vs Paid')
        plt.ylabel('Rating')
        plt.show()

else:
    print('Run feature engineering first.')

### 6.1 Correlations & Top apps
Check numeric correlations and list top-performing apps by installs/reviews.

In [None]:
if 'df3' in globals():
    numeric = df3.select_dtypes(include=[np.number])
    if not numeric.empty:
        corr = numeric.corr()
        display(corr)

    # Top 20 apps by installs
    if 'Installs' in df3.columns and 'App' in df3.columns:
        top_by_installs = df3.sort_values('Installs', ascending=False).head(20)[['App','Category','Installs','Rating_imputed']]
        display(top_by_installs)
else:
    print('Run previous steps first.')

## 7. Key insights (examples)
- Free apps dominate in count.
- Certain categories (e.g., GAME, FAMILY) often have higher installs.
- Ratings have limited correlation with installs in many cases — high installs don't always mean higher rating.

> Replace these example insights with your actual findings after running the notebook on your dataset.

## 8. Next steps / Enhancements
- Sentiment analysis on reviews (NLP)
- Build predictive models (e.g., predict high-install apps)
- Create an interactive dashboard with Plotly Dash or Streamlit
- Deploy cleaned dataset and notebook to GitHub with README and images

---

## 9. Reproducibility
Ensure the repository contains:
- `googleplaystore.csv`
- `google_play_eda.ipynb` (this notebook)
- `README.md` with project summary and sample images
