**Project Idea:** Software Usage and Performance Analysis
- **Objective:** Analyze software usage data to identify patterns in user engagement and software performance issues.
- **Skills Demonstrated:** User behavior analysis, anomaly detection
- **Dataset Link:** [Kaggle - Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps)

#### About Dataset

#### Context
While many public datasets (on Kaggle and the like) provide Apple App Store data, there are not many counterpart datasets available for Google Play Store apps anywhere on the web. On digging deeper, I found out that iTunes App Store page deploys a nicely indexed appendix-like structure to allow for simple and easy web scraping. On the other hand, Google Play Store uses sophisticated modern-day techniques (like dynamic page load) using JQuery making scraping more challenging.

#### Content
Each app (row) has values for catergory, rating, size, and more.

#### Acknowledgements
This information is scraped from the Google Play Store. This app information would not be available without it.

#### Inspiration
The Play Store apps data has enormous potential to drive app-making businesses to success. Actionable insights can be drawn for developers to work on and capture the Android market!

In [2]:
import pandas as pd

In [1]:
# Load the first dataset
file_path_1 = 'googleplaystore_user_reviews.csv'
data_1 = pd.read_csv(file_path_1)

In [3]:
# Displaying the first few rows to understand the structure and content
data_1.head()

Unnamed: 0,App,Translated_Review,Sentiment,Sentiment_Polarity,Sentiment_Subjectivity
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0,0.533333
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25,0.288462
2,10 Best Foods for You,,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4,0.875
4,10 Best Foods for You,Best idea us,Positive,1.0,0.3


The first dataset, `googleplaystore_user_reviews.csv`, contains the following columns:

1. **App**: Name of the app.
2. **Translated_Review**: User reviews translated into English.
3. **Sentiment**: Sentiment of the review (Positive, Negative, Neutral).
4. **Sentiment_Polarity**: Numeric value representing the sentiment polarity.
5. **Sentiment_Subjectivity**: Numeric value representing the sentiment subjectivity.

From the initial preview, it's clear that there are missing values (`NaN`) in the dataset. The data cleaning tasks for this dataset may include:

1. **Handling Missing Values**: Determine how to handle rows with missing data, especially in the `Translated_Review`, `Sentiment`, `Sentiment_Polarity`, and `Sentiment_Subjectivity` columns.
2. **Consistency Checks**: Ensure consistency in categorical data, like the `Sentiment` column.
3. **Data Type Validation**: Ensure that data types for each column are appropriate (e.g., numerical values for `Sentiment_Polarity` and `Sentiment_Subjectivity`).

I'll perform these cleaning tasks on this dataset. Let's start with handling missing values.

After the initial data cleaning on the first dataset (`googleplaystore_user_reviews.csv`), we've addressed the missing values:

1. **Before Cleaning**: There were 26,868 missing values in `Translated_Review`, and a similar count in `Sentiment`, `Sentiment_Polarity`, and `Sentiment_Subjectivity` columns.
2. **Action Taken**: Rows with missing `Translated_Review` were removed, as these rows wouldn't be useful for analysis. This also resolved missing values in the other columns.
3. **After Cleaning**: There are no missing values in the dataset.

Next, I'll check for data consistency, particularly in the `Sentiment` column, and validate the data types for `Sentiment_Polarity` and `Sentiment_Subjectivity`. Let's proceed with these checks.

The consistency and data type checks for the cleaned first dataset (`googleplaystore_user_reviews.csv`) reveal the following:

1. **Sentiment Consistency**: The `Sentiment` column contains three unique values - 'Positive', 'Neutral', and 'Negative'. This indicates consistent categorization with no irregular or unexpected values.
2. **Data Types**:
   - `App` and `Translated_Review` are of type `object`, which is suitable for text data.
   - `Sentiment` is also of type `object`, appropriate for categorical data.
   - `Sentiment_Polarity` and `Sentiment_Subjectivity` are of type `float64`, which is appropriate for numerical values.

The first dataset is now clean and appears ready for analysis. Let's proceed with examining and cleaning the second dataset, `googleplaystore.csv`.

The second dataset, `googleplaystore.csv`, contains the following columns:

1. **App**: Name of the app.
2. **Category**: Category of the app.
3. **Rating**: User rating of the app.
4. **Reviews**: Number of user reviews.
5. **Size**: Size of the app.
6. **Installs**: Number of installs.
7. **Type**: Type of the app (Free/Paid).
8. **Price**: Price of the app.
9. **Content Rating**: Age group the app is targeted at.
10. **Genres**: Genres of the app.
11. **Last Updated**: Date when the app was last updated.
12. **Current Ver**: Current version of the app.
13. **Android Ver**: Android version required to run the app.

For data cleaning, we will need to:

1. **Handle Missing Values**: Identify and treat missing data.
2. **Data Type Validation**: Ensure that data types for each column are appropriate (e.g., numerical values for `Rating`, `Reviews`, and `Price`).
3. **Data Format Consistency**: Some columns like `Size`, `Installs`, and `Price` may require formatting to be consistent and useful for analysis.

Let's start by handling missing values and checking data types.

In the second dataset (`googleplaystore.csv`), we have the following observations:

1. **Missing Values**:
   - `Rating`: 1474 missing values.
   - `Type`, `Content Rating`, `Current Ver`, and `Android Ver` have a few missing values.

2. **Data Types**:
   - Most columns are of type `object`, including `Reviews`, `Size`, `Installs`, and `Price`, which might contain numerical data but are currently in text format.
   - `Rating` is of type `float64`, which is appropriate.

For data cleaning, we'll need to handle these missing values and convert certain columns to appropriate data types. We can start by addressing the missing values. Given the nature of the data, it might be appropriate to fill missing values in `Rating` with a measure like the median or mean, and drop rows with missing values in other columns that have a small number of missing entries. Let's proceed with this approach.

The missing values in the second dataset (`googleplaystore.csv`) have been addressed:

1. **Missing `Rating` Values**: Filled with the median rating.
2. **Other Missing Values**: Rows with missing values in `Type`, `Content Rating`, `Current Ver`, and `Android Ver` were removed.

After these steps, there are no missing values in the dataset.

Next, I will convert certain columns to more appropriate data types and handle any inconsistent formats in columns like `Reviews`, `Size`, `Installs`, and `Price`. Let's proceed with these tasks.

The data type conversions and format adjustments in the second dataset (`googleplaystore.csv`) have been completed:

1. **Reviews**: Converted from `object` to `int64`.
2. **Size**: Converted to numeric values (in MB). Non-numeric values (like 'Varies with device') are set to `None`.
3. **Installs**: Converted from `object` (string with '+' and ',') to `float64`.
4. **Price**: Converted from `object` (string with '$') to `float64`.

The dataset is now clean, with appropriate data types for analysis. Both datasets are ready for further steps in your software usage and performance analysis project. If you have specific analysis tasks or questions in mind, feel free to let me know!