# Analyzing Mobile App Data

This project is about finding the app that will attracts users the most

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.


#### Converting both datasets to lists of lists

In [2]:
#Reusable function to convert csv file to lists of lists dataset
def convt_to_list(csvfile):
    opened_data = open(csvfile, encoding='utf8')
    from csv import reader
    read_dataset = reader(opened_data)
    
    return list(read_dataset)

#Google PlayStore Data
googleplay_data = convt_to_list('googleplaystore.csv')
#Ios App Data
applestore_data = convt_to_list('AppleStore.csv')


Reusable function to print rows or columns in a readable way

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
#Printing number of rows and column in the dataset       
explore_data(googleplay_data, 1 , 0, True)
print('\n')

#Printing some few rows

#Google PlayStore
explore_data(googleplay_data, 0 , 5)
print('\n')
#Apple Ios Store
explore_data(applestore_data, 0 , 5)

Number of rows: 10842
Number of columns: 13


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']




['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating

## Cleaning Data

From the discussion section on Kaggle row 10473 has a column shift, it means, some data where missing for some columns. But i have to check for myself. I also have to check the rows for empty strings and other rows too for possible column shift.

#### Checking for Rows With empty Strings

In [4]:
def check_for_empty_strings(dataset):
    # Initialize lists to store rows and columns with empty strings
    rows_with_empty_strings = []
    columns_with_empty_strings = []

    # Iterate through the dataset rows
    for row_index in range(len(dataset)):
        row = dataset[row_index]

        # Iterate through the cells in the current row
        for col_index in range(len(row)):
            cell = row[col_index]

            if cell == "":
                # Record the row and column indices where an empty string is found
                rows_with_empty_strings.append(row_index)
                columns_with_empty_strings.append(col_index)

    return rows_with_empty_strings, columns_with_empty_strings


# Call the function to check for empty strings
empty_rows, empty_columns = check_for_empty_strings(googleplay_data)

# Print rows and columns with empty strings
if empty_rows:
    print("Rows with empty strings:", empty_rows)
else:
    print("No empty strings found in rows")

if empty_columns:
    print("Columns with empty strings:", empty_columns)
else:
    print("No empty strings found in columns")


Rows with empty strings: [1554, 10473]
Columns with empty strings: [11, 8]


#### Checking for Column shift

In [5]:
def check_for_column_shift(dataset):
    # Check if the dataset is empty
    if not dataset:
        return "Dataset is empty"

    # Determine the expected number of columns based on the length of the header row
    expected_num_columns = len(dataset[0])

    # Initialize a list to store rows with potential column shifts
    rows_with_column_shift = []

    # Iterate through the dataset rows, excluding the header row (start from index 1)
    for row_index in range(1, len(dataset)):
        # Check if the current row has a different number of columns than expected
        if len(dataset[row_index]) != expected_num_columns:
            rows_with_column_shift.append(row_index)

    if not rows_with_column_shift:
        return "No column shifts found"

    return rows_with_column_shift

# Call the function to check for column shifts
column_shifts = check_for_column_shift(googleplay_data)

# Print rows with potential column shifts or a message if no shifts were found
if column_shifts == "Dataset is empty":
    print(column_shifts)
elif column_shifts == "No column shifts found":
    print(column_shifts)
else:
    print("Rows with potential column shifts:", column_shifts)


Rows with potential column shifts: [10473]


You can see that row 1555 and thesame row 10473 has empty string. So we need to delete the row with empty strings. We are deleting these rows to maintain data quality and ensure that your dataset follows a consistent structure and for accurate analysys.

In [6]:
##### Deleting error Rows

In [7]:
del googleplay_data[10473]
del googleplay_data[1554]

#### Checking for rows with duplicate apps

In [8]:
def print_duplicate_apps(dataset, app_name_column, num_duplicates_to_print=5):
    # Check if the dataset is empty
    if not dataset:
        return "Dataset is empty"

    # Initialize a set to store unique app names
    unique_app_names = set()

    # Initialize a list to store duplicate app names
    duplicate_app_names = []

    # Iterate through the dataset rows, starting from the second row (excluding header)
    for row in dataset[1:]:
        app_name = row[app_name_column]

        if app_name in unique_app_names:
            # If the app name is already in the set, it's a duplicate
            duplicate_app_names.append(app_name)
        else:
            # If the app name is not in the set, add it
            unique_app_names.add(app_name)

    if not duplicate_app_names:
        return "No duplicate apps found"

    # Print the specified number of duplicate app names
    print(f"Duplicate Apps (First {num_duplicates_to_print} Names):")
    for app_name in duplicate_app_names[:num_duplicates_to_print]:
        print(app_name)
    return "Total duplicate apps found: " + str(len(duplicate_app_names))


# Specify the column index for the "App Name" column (e.g., column "App Name" in this case)
app_name_column_index = 1

# Call the function to print 5 duplicate app names
result = print_duplicate_apps(googleplay_data, 0, num_duplicates_to_print=5)

print('\n')

# Print the result message
print(result)


Duplicate Apps (First 5 Names):
Quick PDF Scanner + OCR FREE
Box
Google My Business
ZOOM Cloud Meetings
join.me - Simple Meetings


Total duplicate apps found: 1181


I am not going to remove the duplicates randomly. The higher the number of reviews, the more recent the data should be. So we'll keep the any duplicate app with the highest review. 

#### Removing Duplicate apps

In [9]:
def remove_duplicates_keep_highest_reviews(dataset):
    # Create a dictionary to store apps and their highest reviews
    app_highest_reviews = {}

    # Iterate through the dataset, starting from the second row (index 1)
    for row in dataset[1:]:
        app_name = row[0]
        reviews = row[3]

        # Check if the app_name is already in the dictionary
        if app_name in app_highest_reviews:
            # If yes, compare reviews with the stored highest reviews
            if reviews > app_highest_reviews[app_name][3]:
                app_highest_reviews[app_name] = row
        else:
            # If no, add the app to the dictionary
            app_highest_reviews[app_name] = row

    # Convert the dictionary values (rows) to a list to create the cleaned dataset
    cleaned_dataset = [dataset[0]]  # Initialize with the header
    cleaned_dataset.extend(app_highest_reviews.values())

    return cleaned_dataset

# Call the function to remove duplicates and keep entries with the highest reviews
cleaned_dataset = remove_duplicates_keep_highest_reviews(googleplay_data)

# Print the length of cleaned dataset
print('After removing duplicate entries with less reviews we are left with', len(cleaned_dataset), 'row')

After removing duplicate entries with less reviews we are left with 9659 row


#### Filtering Non-English Apps for both datasets
Google Play Dataset

In [10]:
def is_english(app_name):
    # Function to check if the app name contains predominantly English characters
    non_english_chars = 0
    for char in app_name:
        if ord(char) > 127:
            non_english_chars += 1
        if non_english_chars > 3:
            return False
    return True

def filter_english_apps(dataset, app_name_column):
    # Check if the dataset is empty
    if not dataset:
        return "Dataset is empty"

    # Initialize a list to store rows of English apps
    english_apps = []

    # Iterate through the dataset rows, starting from the second row (excluding header)
    for row in dataset[1:]:
        app_name = row[app_name_column]
        
        if is_english(app_name):
            # If the app name is identified as English, append the whole row to the English apps list
            english_apps.append(row)

    if not english_apps:
        return "No English apps found in the dataset"

    # Create a new dataset with only the English apps (including the header row)
    new_dataset = [dataset[0]] + english_apps

    return new_dataset


# Call the function to filter out English apps
googleplay_dataset = filter_english_apps(cleaned_dataset, 0)
applestore_dataset = filter_english_apps(applestore_data, 1)

# Print the new dataset with only English apps
print(len(googleplay_dataset),  'Rows Left after removing Non-English Apps from Google Play Dataset')
print('\n')
print(len(applestore_dataset),  'Rows Left after removing Non-English Apps from Apple Store Dataset')


9614 Rows Left after removing Non-English Apps from Google Play Dataset


6184 Rows Left after removing Non-English Apps from Apple Store Dataset


#### Isolating the Free Apps

In [11]:
def isolate_free_apps(dataset, price_column):
    # Check if the dataset is empty
    if not dataset:
        return "Dataset is empty"

    # Initialize a list to store rows of free apps
    free_apps = []

    # Iterate through the dataset rows, starting from the second row (excluding header)
    for row in dataset[1:]:
        price = row[price_column]
        
        # Remove dollar signs and leading/trailing spaces, and convert to lowercase for case-insensitive matching
        price = price.strip("$").strip().lower()
        
        if price == "0" or price == "0.0" or price == "free":
            # If the app is free (price is "0", "0.0", or "free"), append the whole row to the free apps list
            free_apps.append(row)

    if not free_apps:
        return "No free apps found in the dataset"

    # Create a new dataset with only the free apps (including the header row)
    new_dataset = [dataset[0]] + free_apps
    return new_dataset

print('length of googleplay_dataset: ', len(isolate_free_apps(googleplay_dataset, 7)))
print('length of appleiosstore dataset: ', len(isolate_free_apps(applestore_dataset, 4)))

length of googleplay_dataset:  8862
length of appleiosstore dataset:  3223


## Analysis

Now that we are done with the cleaning part, We'll get to the analysis part. Our company's main goal to generate revenue by attracting more customers, To achieve this, We are going to use the following validation Strategy:

- Make a minimal android version of the app, and add it to GooglePlay.
- If the app is attracting enough users, We develop it further.
- After 6 months, if the app is profitable, we build an IOS version of the app and add it to App Store

We are going to be using the App Store 'prime_genre'  column and Google play dataset Genres and Category' column. We'll be checking the most common apps by Genres to determine the most frequent app genres.

#### Generating a Frequency Table for the two markets( Googleplay store and Ios app store)

In [12]:
def build_frequency_table(dataset, column_index, include_percentages=False):
    # Create an empty dictionary to store frequency counts
    frequency_table = {}

    # Calculate the total number of data points
    total_data_points = len(dataset)

    # Iterate through the dataset
    for data_point in dataset[1:]:
        # Get the value from the specified column
        column_value = data_point[column_index]

        # Check if the value is already in the dictionary, if not, initialize it to 0
        if column_value not in frequency_table:
            frequency_table[column_value] = 0

        # Increment the frequency count for the value
        frequency_table[column_value] += 1

    if include_percentages:
        # Calculate percentages and create a new dictionary
        percentage_table = {}
        for value, count in frequency_table.items():
            percentage = (count / total_data_points) * 100
            percentage_table[value] = percentage

        # Sort the dictionary by percentage in descending order
        sorted_percentage_table = dict(sorted(percentage_table.items(), key=lambda item: item[1], reverse=True))
        
        # Print the percentage table in descending order
        for genre, percentage in sorted_percentage_table.items():
            print(f"{genre}: {percentage:.2f}%")
    else:
        # Return the ordinary frequency table
        return frequency_table

###### Google playStore  frequency table by % for Genre column

In [13]:
#Google playStore Genre column
build_frequency_table(googleplay_dataset, 9, True)

Tools: 8.59%
Entertainment: 5.79%
Education: 5.23%
Business: 4.36%
Medical: 4.11%
Personalization: 3.90%
Productivity: 3.88%
Lifestyle: 3.78%
Finance: 3.59%
Sports: 3.44%
Communication: 3.27%
Action: 3.11%
Health & Fitness: 3.00%
Photography: 2.91%
News & Magazines: 2.60%
Social: 2.49%
Books & Reference: 2.27%
Travel & Local: 2.27%
Shopping: 2.09%
Simulation: 1.98%
Arcade: 1.91%
Dating: 1.78%
Casual: 1.71%
Video Players & Editors: 1.67%
Maps & Navigation: 1.34%
Puzzle: 1.24%
Food & Drink: 1.16%
Role Playing: 1.08%
Strategy: 0.98%
Racing: 0.95%
Auto & Vehicles: 0.87%
Libraries & Demo: 0.86%
Weather: 0.82%
House & Home: 0.76%
Adventure: 0.75%
Events: 0.67%
Art & Design: 0.58%
Comics: 0.56%
Beauty: 0.55%
Card: 0.48%
Parenting: 0.48%
Board: 0.43%
Casino: 0.41%
Educational;Education: 0.40%
Trivia: 0.38%
Educational: 0.38%
Education;Education: 0.36%
Casual;Pretend Play: 0.26%
Word: 0.24%
Music: 0.20%
Puzzle;Brain Games: 0.19%
Education;Pretend Play: 0.18%
Racing;Action & Adventure: 0.17%
Ent

###### Google playStore frequency table by % for Category column

In [14]:
#Google Playstore Category column
build_frequency_table(googleplay_dataset, 1, True)

FAMILY: 19.35%
GAME: 9.79%
TOOLS: 8.60%
BUSINESS: 4.36%
MEDICAL: 4.11%
PERSONALIZATION: 3.90%
PRODUCTIVITY: 3.88%
LIFESTYLE: 3.79%
FINANCE: 3.59%
SPORTS: 3.38%
COMMUNICATION: 3.27%
HEALTH_AND_FITNESS: 3.00%
PHOTOGRAPHY: 2.91%
NEWS_AND_MAGAZINES: 2.60%
SOCIAL: 2.49%
TRAVEL_AND_LOCAL: 2.28%
BOOKS_AND_REFERENCE: 2.27%
SHOPPING: 2.09%
DATING: 1.78%
VIDEO_PLAYERS: 1.70%
MAPS_AND_NAVIGATION: 1.34%
FOOD_AND_DRINK: 1.16%
EDUCATION: 1.11%
ENTERTAINMENT: 0.90%
AUTO_AND_VEHICLES: 0.87%
LIBRARIES_AND_DEMO: 0.86%
WEATHER: 0.82%
HOUSE_AND_HOME: 0.76%
EVENTS: 0.67%
ART_AND_DESIGN: 0.62%
PARENTING: 0.62%
COMICS: 0.57%
BEAUTY: 0.55%


 ###### Apple Ios Store frequency table by percentage for Prime_Genre Column

In [15]:
#Apple Ios prme genre column
build_frequency_table(applestore_dataset, 11, True)

Games: 54.85%
Entertainment: 7.26%
Education: 6.63%
Photo & Video: 5.51%
Utilities: 3.44%
Productivity: 2.72%
Health & Fitness: 2.67%
Music: 2.22%
Social Networking: 2.04%
Sports: 1.68%
Lifestyle: 1.60%
Shopping: 1.37%
Weather: 1.12%
Travel: 0.97%
News: 0.92%
Book: 0.89%
Reference: 0.86%
Business: 0.86%
Finance: 0.79%
Food & Drink: 0.71%
Navigation: 0.45%
Medical: 0.34%
Catalogs: 0.08%


Now from this observation. We can see that The Game genre of Apple Ios store has the most number of apps, fun apps to be precise,  but the category and the genre of the GoogleplayStore has a landscape of both fun and pratical apps.

This just simply tells us that ios developers have put more fun apps in the Games genre and android developers have made a landscape of both pratical and fun apps in the entertainment and tools genre. 

Now finding the the genre that have the most users(installs) will shows us the kind of apps with the most users.

Let's find out because we are building these free apps for user.

#### Most Popular Apps by Genre on the App Store

In [18]:
freq_reviews=build_frequency_table(applestore_dataset,11, False)

for genre in freq_reviews:
    reviews=0
    no_of_apps=0
    for app in applestore_dataset:
        if app[11]==genre:
            reviews+=float(app[5])
            freq_reviews[genre]=reviews
            no_of_apps+=1
    freq_reviews[genre]= freq_reviews[genre]/no_of_apps
    print(genre,':',freq_reviews[genre]) 

Social Networking : 60253.84920634921
Photo & Video : 14688.715542521993
Games : 15586.759433962265
Music : 29047.109489051094
Reference : 27037.188679245282
Health & Fitness : 10802.157575757576
Weather : 23145.246376811596
Utilities : 7927.525821596244
Travel : 19030.183333333334
Shopping : 26635.011764705883
News : 16980.315789473683
Navigation : 19370.821428571428
Lifestyle : 8930.373737373737
Entertainment : 8862.409799554565
Food & Drink : 19934.386363636364
Sports : 15350.913461538461
Book : 10359.2
Finance : 23353.530612244896
Education : 2472.278048780488
Productivity : 8508.089285714286
Business : 5149.320754716981
Catalogs : 3465.0
Medical : 648.952380952381


#### Most Popular Apps by Genre on the Googleplay Store

In [17]:
freq_installs=build_frequency_table(googleplay_dataset,1, False)

for category in freq_installs:
    no_of_installs=0
    no_of_apps=0
    for app in googleplay_dataset:
        installs=app[5].replace(',','')
        installs=installs.replace('+','')
        if app[1]==category:
            no_of_installs+=float(installs)
            freq_installs[category]=no_of_installs
            no_of_apps+=1
    freq_installs[category]=no_of_installs/no_of_apps
    print(category,':',freq_installs[category])
            

ART_AND_DESIGN : 1887285.0
FAMILY : 3344163.6580645163
AUTO_AND_VEHICLES : 632501.3214285715
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 7641777.871559633
BUSINESS : 1663758.627684964
COMICS : 817657.2727272727
COMMUNICATION : 35153714.17515924
DATING : 824129.2807017544
EDUCATION : 1770579.4392523365
ENTERTAINMENT : 11375402.298850575
EVENTS : 249580.640625
FINANCE : 1319851.4028985507
FOOD_AND_DRINK : 1891060.2767857143
HEALTH_AND_FITNESS : 3972300.388888889
HOUSE_AND_HOME : 1331540.5616438356
TOOLS : 9676869.30471584
LIBRARIES_AND_DEMO : 626456.7469879518
LIFESTYLE : 1369954.7774725275
GAME : 14227278.868225291
VIDEO_PLAYERS : 24121489.079754602
MEDICAL : 96691.58734177215
SOCIAL : 22961790.384937238
SHOPPING : 6966908.880597015
PHOTOGRAPHY : 16604098.410714285
SPORTS : 3373767.6861538463
TRAVEL_AND_LOCAL : 13218662.767123288
PERSONALIZATION : 4086652.4853333333
PRODUCTIVITY : 15530942.008042896
PARENTING : 525351.8333333334
WEATHER : 4570892.658227848
NEWS_AND_MAGAZINES : 947

# FINAL RESULT

## The Best Profitable App To Build on Both Platforms

#### In the Google Play Category

Communication: This category has the highest average installs with approximately 35,153,714 installations on average.

Video Players: The video players category has the second-highest average installs, with around 24,121,489 installations on average.

Social: Social apps are the third-highest in terms of average installs, with approximately 22,961,790 installations on average.

#### In the Apple Ios Store Genre
Social Networking: This genre has the highest average rating, with an average of approximately 60,253.85.

Music: Music apps have the second-highest average rating, with an average of approximately 29,047.11.

Weather: Weather apps are the third-highest in terms of average ratings, with an average of approximately 23,145.25.



We could build a social networking app like Facebook, Instagram or Twitter although it will be free, but it will complex and resource-intensive endeavor. It requires careful planning, a skilled development team, and ongoing maintenance and updates to stay competitive and secure in the evolving tech landscape. 

I recommend we build a Music app, incoperating features like Video Player that enable people watch the Music video directly on the app, comment rather than going to Youtube. We can also put features like lyric display and ability to send a message to the artist directly on the app about how the song was. 

For ads, we can have music ads, like those on Spotify, when someone is listening to a song the music advert play and video ads display like those on youtube all in the app.

We could call the Music App sounvid( Sounds and videos). 