# Stock Price Prediction for the 5 Biggest Brazilian Commodity-Base Companies  

<figure>
<img src="./images/image_1.jpg" style = "width:1100px; height:300px">
<figure/>

## 1. Overview
### 1.1. Motivation
The stock price market might seems chaotic at a first glance. There is a lot of information that must be taken into account in order to decide whether to buy or not a stock. Some say that the Fundamental Analysis\[1] is the right way to decide whether or not to buy a stock, and some say that the Technical Analysis\[2].

Based on this lack of convergence about which method should be used, and once that we can find a giant amount of data, a data science project could be a nice fit to predict the behaviour of the prices in the next day. Therefore we could use these predictions to guide the decisions.

Nowadays, besides corona crises, the markets of the USA are on their historic maximum. This fact means that there is not so much place to find a share of business with good profits possibilities. Therefore, when we would like to have a bigger profit, we must take more risk and investments that fit this profile are those in emergent markets, like Brazil, India, China, etc. I’m a chemical engineer from Brazil and to use data science to understand the market behaviour for the next weeks are a topic that interests me a lot and can bring my education, my experience living here and my data science skills together.

As we know, the commodity is an economic good that is often used as input the production of other goods or services. These goods have full or substantial fungibility, that is, they can be treated as equivalent (or nearly) regardless of who produced them. Some examples are coffee, gold ore, iron ore, oil, water, electric power, etc. Brazil is a land that has a commodity-based economy\[3] and to be able to predict the share prices of business that work in some level with commodities, cold be an excellent opportunity for discovering bad and good calls.

Besides that, nowadays there are lots of discussions about a commodity rally in 2021\[4]\[5]\[6]\[7]. That means, understanding the future of these goods and being able the share price of the companies or even the commodity itself like gold or silver, are key points to have success investing in Brazil.

So let’s dive in the Brazilian Stock Prices of commodity-based companies, explore them and predict its prices! 


### 1.2. Problem Statement
As mention above, our interest lay on the future share price of commodity-based companies. It gets even more dramatic with 2021 bringing the potential for a commodity rally. To predict those stock prices are a big prize.
**Based on this, we can define that our problem is to analyse the 5 biggest commodity-based companies in Brazil in order to understand their behaviour and relationships as well to get a deeper understanding about what we are going to predict. Then, an algorithm as well its pipeline will be designed and implemented in order to predict the prices for the future.** 


### 1.3. Data and Inputs  
The data will be gathered using the Yahoo! Finance API. A great tutorial (unfortunately only in Portuguese) can be found in this [medium post](https://medium.com/@rodrigobercinimartins/como-extrair-dados-da-bovespa-sem-gastar-nada-com-python-14a03454a720). The library yahooquery gives us all tools to colect the data as we want.  

We will gather data of the 5 biggest commodity-based companies in Brazil, that are:
- Petrobras (PETR4): largest Brazilian company, Petrobras produces oil;
- Vale (VALE3): the company is among the largest companies in Brazil, and is the largest producer of iron ore in the world;
- CSN (CSNA3): it is the largest steel industry in Brazil and Latin America, and one of the largest in the world;
- JBS (JBSS3): the company is one of the largest producers of animal protein in the world;
- Suzano (SUZB3): the company is considered the largest producer of eucalyptus pulp of the world;

The data will be divided into 7 features for each day: lowest, highest, open, closed and adjusted close price, as well as volume and ticker.

### 1.4. General Outline   
    a. Loading and Creating the data;  
       a.1. Loading the data from Yahoo;




In [9]:
# loading packages and utilities
import numpy as np
import pandas as pd
import tensorflow as tf
import datetime

from tensorflow import keras
from yahooquery import Ticker


In [230]:
def gather_stocks(company_name_list, start = '2007-01-01', end = datetime.date.today()):
    """get the name of the companies, as well as the start and end date, check if they are with the
    ".SA" ending and create a dataframe.
    """
    def name_check(name):
        if name[-2:] != "SA":
            name = name + ".SA"
        return name
    company_name_list = list(map(name_check, company_name_list))

    df = pd.DataFrame(Ticker(comp).history(start = start, end = end)).reset_index()
    return df


### a.1. Loading the data through Yahoo API

In [244]:
# listing the companies and gathering the data
companies = ["PETR4.SA", "VALE3.SA", "CSNA3.SA", "JBSS3.SA", "SUZB3.SA"]
start_date = '2007-01-01' #one year before the comodities rally in Brasil
end_date = datetime.date.today() #today
df = gather_stocks(companies, start = start_date, end = end_date)

In [248]:
df.head()

Unnamed: 0,symbol,date,high,close,open,volume,low,adjclose,dividends,splits
0,PETR4.SA,2007-01-02,25.225,25.16,25.0,10244800.0,24.879999,18.318081,0.225,0.0
1,PETR4.SA,2007-01-03,25.200001,24.395,25.08,19898600.0,24.004999,17.761106,0.0,0.0
2,PETR4.SA,2007-01-04,24.375,23.85,24.25,21060200.0,23.700001,17.364315,0.0,0.0
3,PETR4.SA,2007-01-05,23.995001,23.125,23.6,24864000.0,22.549999,16.836473,0.0,0.0
4,PETR4.SA,2007-01-08,23.57,23.395,23.25,19440200.0,22.9,17.033047,0.0,0.0


In [245]:
# checkig for missing values
df.isnull().sum()

symbol       0
date         0
high         0
close        0
open         0
volume       0
low          0
adjclose     0
dividends    0
splits       0
dtype: int64

In [246]:
# checking the infos of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17263 entries, 0 to 17262
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   symbol     17263 non-null  object 
 1   date       17263 non-null  object 
 2   high       17263 non-null  float64
 3   close      17263 non-null  float64
 4   open       17263 non-null  float64
 5   volume     17263 non-null  float64
 6   low        17263 non-null  float64
 7   adjclose   17263 non-null  float64
 8   dividends  17263 non-null  float64
 9   splits     17263 non-null  float64
dtypes: float64(8), object(2)
memory usage: 1.3+ MB


In [247]:
df.date.unique().shape

(3496,)

There are 3496 instances for each company. As we will work with 5 companies, there are in total 17263 instances for each of the 7 columns and no missing values.