# Alcohol Consumption in Russia

![Alcoholic Beverages in Russia](images/drinks.png)
            Source: [The Russian alcohol market: a heady cocktail](http://www.food-exhibitions.com/Market-Insights/Russia/The-Russian-alcohol-market)

## Project Motivation

A fictitious company owns a chain of stores across Russia that sell a variety of types of alcohol. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful and need help determining which regions they should select.

![Regions in Russia](images/regions.png)
        Source: [Outline of Russia](https://en.wikipedia.org/wiki/Outline_of_Russia)
        
This project aims to use machine learning algorithm to recommend, at least 10 regions with alcohol buying habits similar to Saint Petersburg. 

## The Dataset

The data used in this project is obtained from [Datacamp's Career Hub repository](https://github.com/datacamp/careerhub-data) on GitHub. It contains 7 variables as see in the description below:

![Description of dataset](images/data_description.png)

## Analysis Plan

Based on the ask of the project, the problem is best solved using an unsupervied machine learning algorithm that could best cluster regions based on wine sales in Saint Petersburg. Selection of this algorithm will be done in subsequent sections.

The following steps will be followed:

- Perform Exploratory Data Analysis to identify patters and draw insights from the data.
- Select a suitable unsupervised machine learning algorithm based on problem to solve and information from the exploratory data analysis.
- Discuss model performance.

### Exploratory Data Analysis

This section will explore the data to discover trends and insights. It will be done by creating plots of features against their values. The following steps will be implemented:

- Read data
- Check for data quality issues.
- Data Visualization to observe patterns and trends.

In [3]:
# import system and exploratory analysis modules
import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy as np; print("Numpy", np.__version__)
import matplotlib
import matplotlib.pyplot as plt; print("Matplotlib", matplotlib.__version__)
import pandas as pd; print("Pandas", pd.__version__)
import seaborn as sns; print("Seaborn", sns.__version__)
import scipy; print("Scipy", scipy.__version__)
import sklearn; print("Scikit -Learn", sklearn.__version__)

Windows-10-10.0.19041-SP0
Python 3.6.12 |Anaconda, Inc.| (default, Sep  9 2020, 00:29:25) [MSC v.1916 64 bit (AMD64)]
Numpy 1.19.2
Matplotlib 3.3.2
Pandas 1.1.5
Seaborn 0.11.1
Scipy 1.5.2
Scikit -Learn 0.23.2


In [None]:
# function to read data, check for nulls and drop duplicates
def read_data(data_path):
    # read data
    print("Reading Alcohol Consumption in Russia dataset\n")
    df = pd.read_csv(data_path)
    # make a copy of dataframe
    print("Making a copy of the dataframe\n")
    df_1 = df.copy()
    # drop duplicates
    df_final = df_1.drop_duplicates()
    # extract feature names
    df_cols = df_final.columns.tolist()
    # empty list to hold data types, non nulss count, nulss count, percentage of nulls in a column,\
    # percentage of column nulls in datafram
    data_types = []
    non_nulls = []
    nulls = []
    null_column_percent = []
    null_df_percent = []
    
    # loop through columns and capture the variables above
    print("Extracting count and percentages of nulls and non nulls")
    for col in df_cols:
        
        # extract null count
        null_count = df_final[col].isna().sum()
        nulls.append(null_count)
        
        # extract non null count
        non_null_count = len(df_final) - null_count
        non_nulls.append(non_null_count)
        
        # extract % of null in column
        col_null_perc = 100 * null_count/len(df_final)
        null_column_percent.append(col_null_perc)
        
        # extract % of nulls out of total nulls in dataframe
        df_null_perc = null_count/df_final.isna().sum().sum()
        null_df_percent.append(df_null_perc)
        
        # capture data types
        data_types.append(df_final[col].dtypes) 
    # create zipped list with column names, data_types, nulls and non nulls
    lst_data = list(zip(df_cols, data_types, non_nulls, nulls, null_column_percent, null_df_percent))
    # create dataframe of zipped list
    df_zipped = pd.DataFrame(lst_data, columns = ['Feature', 'DataType', 'CountOfNonNulls', 'CountOfNulls',\
                                                 'PercentOfNullsIinColumn', 'PercentOfNullsInData'])
    return df_final, df_