Data analystds for a company that builds Android and iOS mobile apps. 
We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. 

This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. 

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost.

The explore_data() function:

Takes in four parameters:
dataset, which is expected to be a list of lists.
start and end, which are both expected to be integers and represent the starting and the ending indices of a slice from the data set.
rows_and_columns, which is expected to be a Boolean and has False as a default argument.
Slices the data set using dataset[start:end].
Loops through the slice, and for each iteration, prints a row and adds a new line after that row using print('\n').
The \n in print('\n') is a special character and won't be printed. Instead, the \n character adds a new line, and we use print('\n') to add some blank space between rows.
Prints the number of rows and columns if rows_and_columns is True.
dataset shouldn't have a header row, otherwise the function will print the wrong number of rows (one more row compared to the actual length).

In [None]:
#Импортируем pandas и mathplotlib
# Рисовать графики сразу же
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from csv import reader

plt.style.use('ggplot')  # Красивые графики
plt.rcParams['figure.figsize'] = (15, 5)  # Размер картинок

In [None]:
#Форматированные строки - ставим F и фигурные скобки
#Матанализ, Линейная алгебра, Матстаистика, Теория вероятностей, Комбинаторика

intensive = "Обучение в ускоренном формате"
duration = "130 минут"
speaker = "Саша"
profession = "джуниор"
whois = speaker + ", " + profession
print(f"Сегодня у нас {intensive}, проведет {whois}, длительностью {duration} и расскажет о том как рассчитываются налоги")

income = [12,26,43,54]
for i in income:
    tax = 0.13 * i
    print(f"Если Вы получили доход в размере {i}, должны уплатить налог в размере {tax}")
    
plt.plot(income)
plt.show()

In [None]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [None]:
### The Google Play data set ###
opened_file = open('googleplaystore.csv', encoding='utf-8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

In [None]:
### The App Store data set ###
opened_file = open('AppleStore.csv', encoding='utf-8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [None]:
for row in android:    
    if len(row) != len(android_header):
        print(android.index(row))     

del android[10472]

for row in ios:    
    if len(row) != len(ios_header):
        print(ios.index(row)) 

In [None]:
print(android_header)
print(ios_header)


Google Play data set has duplicate entries. Print a few duplicate rows to confirm.
Count the number of duplicates
use this information to build a criterion for removing the duplicates.

In [None]:
unique_android = []
duplicate_android = []
for app in android:
    check_name = app[0]
    if check_name in unique_android:
        duplicate_android.append(check_name)
    else:
        unique_android.append(check_name)
        
print("Оригинальных приложений Google: ", len(unique_android))
print("Дублирующихся строк Google: ", len(duplicate_android))

unique_ios = []
duplicate_ios = []
for app in ios:
    check_name = app[0]
    if check_name in unique_ios:
        duplicate_ios.append(check_name)
    else:
        unique_ios.append(check_name)
        
print("Оригинальных приложений Apple: ", len(unique_ios))
print("Дублирующихся строк Apple: ", len(duplicate_ios))
        

Create a dictionary where each key is a unique app name 
and the corresponding dictionary 
value is the highest number of reviews of that app.
Use the dictionary you created above to remove the duplicate rows

In [None]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print("Всего строк в словаре Google: ", len(reviews_max))

reviews_max_ios = {}
for app in ios:
    name = app[0]
    n_reviews = float(app[5])
    if name in reviews_max_ios and reviews_max_ios[name] < n_reviews:
        reviews_max_ios[name] = n_reviews
    elif name not in reviews_max_ios:
        reviews_max_ios[name] = n_reviews

print("Всего строк в словаре Apple: ", len(reviews_max_ios))

android_clean = []
android_added = []
for app in android:
    name = app[0]
    n_reviews = float(app[3])   
    if (reviews_max[name] == n_reviews) and (name not in android_added):
        android_clean.append(app)
        android_added.append(name)

ios_clean = []
ios_added = []
for app in ios:
    name = app[0]
    n_reviews = float(app[5])   
    if (reviews_max_ios[name] == n_reviews) and (name not in ios_added):
        ios_clean.append(app)
        ios_added.append(name)
            
print("Всего строк в очищенном листе Google: ", len(android_clean))
print("Всего строк в очищенном листе Apple: ", len(ios_clean))
        

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII

In [None]:
def eng_check(string):
    x=0
    for character in string:        
        if ord(character) > 127:
            x += 1
    if x > 3:
        return False        
    else:
        return True

print(eng_check('Docs To Go™ Free Office Suite'))
print(eng_check('Instachat 😜'))
print(eng_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))

    
            
            

Isolating the free apps will be our last step in the data cleaning process. On the next screen, we're going to start analyzing the data.

In [None]:
android_free = []
android_non_free = []
for app in android_clean:
    name = app[0]
    n_price = app[7]   
    if n_price == "0":
        android_free.append(app)
    else:
        android_non_free.append(app)

print("Бесплатных приложений Google: ", len(android_free))
print("Платных приложений Google: ", len(android_non_free))

ios_free = []
ios_non_free = []
for app in ios_clean:
    name = app[0]
    n_price = app[4]   
    if n_price == "0.0":
        ios_free.append(app)
    else:
        ios_non_free.append(app)

print("Бесплатных приложений Apple: ", len(ios_free))
print("Платных приложений Apple: ", len(ios_non_free))

In [None]:
android_eng_only = []
for app in android_free:
    name = app[0]
    if eng_check(name):
        android_eng_only.append(app) 

ios_eng_only = []
for app in ios_free:
    name = app[1]
    if eng_check(name):
        ios_eng_only.append(app)
        
print("Бесплатных английских приложений Google: ", len(android_eng_only))
print("Платных английских приложений Apple: ", len(ios_eng_only))
     

Определяем число приложений по жанрам

In [None]:
popular_app_android = {}

for app in android_eng_only:
    genre = app[1]
    if genre in popular_app_android:
        popular_app_android[genre] += 1
    else:
        popular_app_android[genre]  = 1

print(popular_app_android)

popular_app_ios = {}

for app in ios_eng_only:
    genre = app[-5]
    if genre in popular_app_ios:
        popular_app_ios[genre] += 1
    else:
        popular_app_ios[genre]  = 1

print(popular_app_ios)

    
        

Ищем самые популярные приложения
Для этого сортируем по популярности
Добавляем числа в процентах

In [None]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages




In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', round(entry[0], 2), "%")
        
final_ios = display_table(ios_eng_only, -5)
print(final_ios)
final_android = display_table(android_eng_only, 1)
print(final_android)
    

In [None]:
#ML создается чтобы что-нибудь улучшить. 
#Задачи ML: делать предсказания, анализировать и генерировать изображения, видео
#Этап обслуживания: сбор данных, преобразование, обучение, предсказание (оценка качества модели)
#Учим на примерах. Направить на достижение цели, постановки задачи
#Базовые инструменты ML: Sklearn, Pandas, Matplotlib

#Датафрейм

print(android_header)
android.Category.describe()

In [None]:
fixed_df.price.describe() #Название столбца и его ключевые параметры

fixed_df.user_rating.value_counts() #Смотрим другую колонку и распределение значений в числах
fixed_df.user_rating.value_counts().plot(kind = "bar") #Добавляем через точку и систематизируем выдачу всего в графе
#Поскольку они идут подряд, то выдается только последнее значение в ячейке

In [None]:
fixed_df.user_rating.hist() #Рисуем гистограммы

In [None]:
fixed_df.prime_genre.value_counts().plot(kind = 'bar') #Разведочный анализ. понимаем что у нас собрался за датасет

In [None]:
pd.get_dummies(fixed_df, columns=['track_name', 'currency', 'prime_genre'])

In [None]:
#строим модель
#1 этап - подготовка данных 
#Заменяем данные строк на 0 и 1 

fixed_df_transformed = pd.get_dummies(fixed_df, columns=['track_name', 'currency', 'prime_genre'])

In [None]:
#Данные, на основе которых хотим научиться делать предсказание
input_data = fixed_df_transformed.drop('user_rating_ver', axis=1)
user_rating_ver = fixed_df_transformed.user_rating_ver #То что мы пытаемся предсказать

In [None]:
model = RandomForestClassifier()

model.fit(input_data, user_rating_ver)

In [None]:
{col:0 for col in input_data.columns}