# Tobacco Consumption

Tobacco consumption is one of the primary causes of lung cancer in the World. Tobacco in the form of cigars and cigarettes is usually available to adult population in many supermarkets and grocery stores. The data obtained for this analysis describes Tobacco Consumption in USA from 2000 to 2020. From behavior of the data in those 21 years, the aim of the project is to predict total tobacco consumption in 2021 and 2022. 

At first, the libraries used for this project are imported.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import numpy as np
import seaborn as sns
import random
import math
from statsmodels.tsa.seasonal import seasonal_decompose

An additional import is included in order to ignore some warnings while processing the data.

In [None]:
import warnings
warnings.simplefilter(action="ignore", category=FutureWarning)

## Extraction

The data for this project is stored in a *.csv* file. The path to the file is defined in the variable *DATA_PATH*.

In [None]:
DATA_PATH = "../data/Tobacco_Consumption.csv"

The file is read and a sample of the data is shown.

In [None]:
tobacco_data_raw = pd.read_csv(DATA_PATH)
tobacco_data_raw.sample(10)

## Exploratory Data Analysis

Describe data table

In [None]:
tobacco_data_raw.info()

In this table, there are categorial and numerical variables.

The exploration will initially focus on categorical variables and later on the numerical ones. 

### Categorical Data Exploration

The categorical data columns are filtered from the original dataframe.

In [None]:
# Filter categorical variables from data
tobacco_categorical_data = tobacco_data_raw.select_dtypes(exclude=['int', 'float'])
# Show head of tables
tobacco_categorical_data.head(10)

Categorical data columns are identified.

In [None]:
# Show numbers of columns
print(f"There is a total  of {len(tobacco_categorical_data.columns)} categorical data columns")
# Show name of the columns
print(f"The columns are: {tobacco_categorical_data.columns}")

To explore the frecuency of elements for each column, frecuency is ploted in a bar chart, where x axis is the name of the elements in the column, and yaxis is the number of times the element is in the column.

In [None]:
# Create plot object
fig, ax = plt.subplots(2,3, figsize=(20, 15))
fig.subplots_adjust(hspace=.5)
i = 0
# Add subplot of frecuency of elements per column of categociall data
for col in tobacco_categorical_data.columns:
    sns.countplot(tobacco_categorical_data[col], ax=ax[i%2, math.floor(i/2)])
    i+=1
# Rotate axis of each subplot
for ax in fig.axes:
    plt.sca(ax)
    plt.xticks(rotation=45)

For *LocationDesc* and *LocationAbbrev* columns there is only one unique value each. Therefore, these columns are constants.

Most values in submeasure have a 21 apperances in the table.

The combinations of values in the columns "Measure", "Submeasure" and "Units" is further explored, to identify how many time each different combinations is shown in the table.

#### Categorical data combinations

Unique combinations of categories are obtained.

In [None]:
# Get unique combinations by dropping duplicated categorical columns
tobacco_categorical_data.drop_duplicates()

Describe combinations and unique combinations.

In [None]:
# Get number of unique combinations and total combinations in the table
total_categories_combinations = len(tobacco_categorical_data)
unique_categories_combinations = len(tobacco_categorical_data.drop_duplicates())
# Print summary
print(f"Total combinations of categories (rows): {total_categories_combinations}")
print(f"Find {unique_categories_combinations} unique category combinations")
print(f"Relation: {total_categories_combinations/unique_categories_combinations}")

13 combinations are repeated 21 times in the table.

This number match the number of years in the data. The dataset included 13 different values per year.

### Numerical Data Exploration

The numerical data columns are filtered from the original dataframe.

In [None]:
# Filter numerical variables from data
tobacco_numerical_data = tobacco_data_raw.select_dtypes(include=['int', 'float'])
# Show head of tables
tobacco_numerical_data.head(10)

Numerical data columns are identified.

In [None]:
# Show numbers of columns
print(f"There is a total  of {len(tobacco_numerical_data.columns)} numerical data columns")
# Show name of the columns
print(f"The columns are: {tobacco_numerical_data.columns}")

To understand how each variable is related to each other, correlations are obtained and plotted.

In [None]:
# Explore correlations
correlations = tobacco_numerical_data.corr()
# Plot correlations
sns.heatmap(correlations, annot=True)
plt.show()

*Year* and *Population* have a strong correation with each other, but a low correation to tobacco values.

*Per capita values* have a strong correlation with normal values. 

A test is applied to verify if per capita values are obtained from total values and population.

In [None]:
# Obtain difference between per capital columns and normal columns divided by population
relation_per_capita = round(tobacco_numerical_data["Total"]/tobacco_numerical_data["Population"], 1) - tobacco_numerical_data["Total Per Capita"]
round(relation_per_capita.median(), 3)

The difference is close to 0. Therefore, the next expressions can be established from the data:

$$
Domestic\_per\_capita= \frac{Domestic}{Population}
$$
$$
Imports\_per\_capita= \frac{Imports}{Population}
$$
$$
Total\_per\_capita= \frac{Total}{Population}
$$

For further analysis, per capita columns are excluded.

*Domestic* and *Imports* have a strong correlation to *Total* column.

In [None]:
# Difference between total and imports + domestic is obtained
difference_total = tobacco_numerical_data["Total"]- tobacco_numerical_data["Domestic"] - tobacco_numerical_data["Imports"]
difference_total.median()

The difference is 0, so
$$
Total = Imports + Domestic
$$

## Data Wrangling

## Modeling

## Conclusions