# Data exploration

## Python

Python stands out as a high-level programming language. Beyond its versatility and ease of learning, Python boasts an extensive ecosystem, enabling the utilization of a myriad of diverse libraries.

## Modules, Packages and Libraries
A module serves as a repository of functions or methods. A package, in turn, comprises a set of modules. A library is an assembly of packages. Libraries facilitate code reusability by offering pre-built features for diverse problems, thereby obviating the necessity to repeatedly recreate identical code.
To install a new library, usually, you can simply type in the terminal/command line: pip install [module_name].

In this workshop, we will make use of the following libraries:

* numpy
* pandas
* scikit-learn
* tensorflow
* matplotlib
* seaborn
* plotl

Below are exemplified two ways of importing a library.y.st

In [None]:
import numpy as np
from numpy import array

In the first case, the library is imported in its entirety. In the second case, only a specific method is imported.

Import all the remaining libraries in their entirety.

## Dataset

A dataset is a compilation of data, encompassing not only tabular forms but also various types such as images, text, and more. A tabular dataset is composed of objects and features, a set of characteristics/variables of a specific object. Variables enable the storage and representation of values. In Python, there is no need to predefine the variable type.

Type of variables:
- string;
- numeric (integer or float);
- boolean;
- lists;
- tuples;
- dictionaries.

**Boolean**

The boolean objects are objects that are either true or false (1 or 0, respectively). These objects allow the evaluation of conditions. To evaluate conditions, operators such as **if**, **elif**, **else**, **and** and **or** are used. In other words, the if, elif, and else statements allow the evaluation of conditional premises and, eventually, take action based on them. The and and or operators allow the evaluation of joint premises.

The code below illustrates a if condition.

In [None]:
k = 1
if k == 1:
    print(x)

**Lists and dictionaries**

Lists are a way of storing objects and can be created using [] or through the list() notation. The dictionary can be created using {} or the dict() notation. Unlike lists, dictionaries are not inherently ordered; however, they operate on a key-value logic that allows storing data relationally.

Examples of functions used in lists:
* append()
* del()
* len()


In [None]:
list_numbers = [0,1,2,3,4]
list_numbers.append("A")
print(list_numbers)

Remove the value 0.

It is possible to add the previous list to a dictionary.

In [None]:
dict_mouth = {"jan": 31, "fev": [27,28], "mar":31}
dict_mouth["new_key"] = list_numbers
print(dict_mouth)

**Iterators**

Iterators enable traversing objects and object storages, evaluating values, and performing operations.
* for
* while

The code below illustrates an iteration of a list.

In [None]:
for number in list_numbers:
    print(number)

## pandas

According to its creators, "pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool".


Types of pandas data structures:
- Series;
- Dataframes.

The function read_csv() is used to read the file containing the data, thus generating a DataFrame.

In [None]:
opened_data = pd.read_csv("data/all_stocks_5yr.csv")

The function head() allows visualization of the first rows of the table.

In [None]:
opened_data.head()

In turn, the info() function allows obtaining global information about the DataFrame.

In [None]:
opened_data.info()

The describe() function also allows obtaining information about the table.

In [None]:
opened_data.descrive()

At times, not all columns have values (missing values). The dropna() function resolves this issue.

In [None]:
opened_data.dropna(inplace=True)

There are several tools that allow for category-based analysis.

In [None]:
count_by_company = opened_data.groupby("Name").count()
print(count_by_company)

The .loc attribute can be used to locate rows based on the label. It is also possible to combine the .loc attribute with boolean logic, thus making sub-partitions of the data according to conditions.

In [None]:
opened_data.loc[opened_data["Name"] == "APP"]

The iloc attribute allows for positional-based locating.

In [None]:
opened_data.iloc[0:15,1]

#**Exercise: How many different years appear in the dataset?**

#**Exercise: What is the percentage of rows from 2018?**

#**Exercise: what is the percentage of rows for each year?**

# Data visualization

## seaborn

Seaborn is a python library based on matplotlib that allows the creation of graphics. To customize the plot, it is necessary to delve deeper into the documentation of the library. 
Functions for the production of graphics:
* barplot()
* boxplot()
* violinplot()
* lineplot()
* heatmap()
* histplot

Now we will work with the file processed_stock_data.csv, a file that only contains the average closing values of the stock market per year. 

reformed_df = pd.read_csv("data/processed_stock_data.csv")
reformed_df.head(5)

You can calculate the mean, standard deviation, and variance along an axis. Choosing 0 will calculate along the columns, while 1 will calculate along the rows.

In [None]:
yearly_average_df = reformed_df.mean(axis = 0)

In [None]:
The code below allows you to create a bar chart.

sns.barplot(yearly_average_df, legend=False)

It is possible to customize the chart according to individual preferences and the representation's objective by adding and modifying arguments.

In [None]:
sns.barplot(yearly_average_df, legend=False)
plt.title("Yearly stock value")
plt.xticks(rotation=45)

Boxplots enable the graphical visualization of outliers.

In [None]:
sns.violinplot(reformed_df)

#**Exercise: Build a violinplot to further assess the distribution**

#**Exercise: Identify the top 5 stocks with lowest closing in 2018 and build a lineplot with the years as the x-axis**

# Data preprocessing

## Outlier detection and handling

An outlier is a data point in the dataset that deviates significantly from the rest of the data or observations. If neglected, this type of data can disrupt the intended analysis. Outliers can be detected through the use of quartiles.


In [None]:
quantile_low = reformed_df["2018_close"].quantile(0.01)
quantile_high  = reformed_df["2018_close"].quantile(0.99)

reformed_df_filtered = reformed_df[(reformed_df["2018_close"] < quantile_high) & (reformed_df["2018_close"] > quantile_low)]

fig, axes = plt.subplots(1, 2)
sns.violinplot(reformed_df, ax=axes[0])
sns.violinplot(reformed_df_filtered, ax=axes[1])
plt.show()

## scikit-learn

The scikit-learn contains a vast number of functions for machine learning, as well as data preprocessing functions. These functions enable the analysis and preparation of data to undergo machine learning processes.

## Features, target variable

Before executing an ML protocol, it is necessary to split the data into features and target variable. The target variable is what we aim to predict, serving as the dependent variable. Features, on the other hand, are the independent variables used in predicting the target variable.

In [None]:
features = reformed_df.drop(columns = ["2018_close"])
target_variable = reformed_df["2018_close"]

## Data spliting

Next, it is essential to split the dataset into training and testing sets, with the target variable isolated since it is the variable we want to predict. To do this, we will use the train_test_split function from sickit-learn.


In [None]:
from sklearn.model_selection import train_test_split
train_features, test_features, train_target, test_target = train_test_split(features, target_variable, random_state = 42)
print(train_features.shape, test_features.shape, train_target.shape, test_target.shape)

## Data Standardization/normalization

Data standardization aids in preparing datasets for analysis by normalizing features to a consistent scale. This process ensures that different features contribute equally to the analysis.


In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(train_features)

In [None]:
scaled_train_features = scaler.transform(train_features)
scaled_test_features = scaler.transform(test_features)