Data Analysis Project
Looking at some sample data

In [21]:
from csv import reader

open_data = open('googleplaystore.csv')
read_data = reader(open_data)
google_store = list(read_data)

google_header = google_store[0]

Below is a useful function for looking through the data to get a snapshot of it.

In [5]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [19]:
explore_data(google_store,0,5, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


The cell below was useful to see if we had any duplicate entries. Which were prevelant within the dataset.

In [6]:
unique_entries = []
duplicate_entries = []

for entries in google_store:
    name = entries[0]
    if name in unique_entries:
        duplicate_entries.append(name)
    unique_entries.append(name)
    
len(duplicate_entries)

1181

The two cells below were used to clean the data. Instead of deleting the duplicates randomly, I kept the entries with the highest amount of reviews. With a final clean list  put together named 'android_clean'. The header was also taken out.

In [7]:
reviews_max = {}

for app in google_store[1:]:
    name = app[0]
    n_reviews = float(app[2])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews     

In [8]:
android_clean = []
already_added = []

for app in google_store[1:]:
    name = app[0]
    n_reviews = float(app[2])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

I also took out the Non-English entries as well as taking out the paid apps, as this analysis will focus on the free apps.

In [9]:
def english_test(string):
    restricted_char = 0
    for characters in string:
        if ord(characters) < 127:
            pass
        else:
            restricted_char += 1
    if restricted_char > 3:
        return False
    else:
        return True

In [10]:
android_english = []

for entries in android_clean:
    if english_test(entries[0]):
        android_english.append(entries)

In [11]:
android_free_final = []

for entries in android_english:
    if entries[7] == '0':
        android_free_final.append(entries)

Now that the data is properly cleaned, proper analysis can now be done. The first function below creates a frequency table for any index that we want to look at, and the second puts it into a more friendly format.

In [12]:
def freq_table(dataset, index):
    freq_dict = {}
    for values in dataset:
        if values[index] in freq_dict:
            freq_dict[values[index]] += 1
        else:
            freq_dict[values[index]] = 1
    return freq_dict

In [13]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [15]:
display_table(android_free_final, 1)

FAMILY : 1443
GAME : 835
TOOLS : 656
FINANCE : 289
PRODUCTIVITY : 282
LIFESTYLE : 279
BUSINESS : 253
PHOTOGRAPHY : 248
SPORTS : 238
COMMUNICATION : 234
PERSONALIZATION : 233
HEALTH_AND_FITNESS : 232
MEDICAL : 228
SOCIAL : 201
NEWS_AND_MAGAZINES : 198
TRAVEL_AND_LOCAL : 179
SHOPPING : 178
BOOKS_AND_REFERENCE : 159
VIDEO_PLAYERS : 144
DATING : 131
EDUCATION : 113
MAPS_AND_NAVIGATION : 112
ENTERTAINMENT : 100
FOOD_AND_DRINK : 92
AUTO_AND_VEHICLES : 72
WEATHER : 65
LIBRARIES_AND_DEMO : 64
HOUSE_AND_HOME : 62
ART_AND_DESIGN : 57
COMICS : 53
PARENTING : 48
EVENTS : 45
BEAUTY : 42
