# Exploring Bovespa index(IBOV)
## An analysis of the prices and returns of the largest Brazilian stock index

### Introduction

The Bovespa Index (Portuguese: Índice Bovespa), best known as Ibovespa is the benchmark index of about 73 stocks that are traded on the B3. The Ibovespa is the most important indicator of the Brazilian stock market. It is designed to summarize in a single number the general behavior of the main shares traded, facilitating the monitoring and disclosure of the average profitability of these shares.

For this purpose, a theoretical stock portfolio is defined, composed of the companies that represented the majority of the financial volume traded at B3 (merger of Cetip with BM & FBovespa) during a certain period.

The Ibovespa simulates the performance of an investment of funds in this theoretical portfolio, in a scenario in which there are no additional contributions or redemptions, and in which all dividends are reinvested.

<img src='https://i.imgur.com/Li58FWq.png' style='width:1600px;height:350px'/>

**Figure 1**: Ibovespa Index. Source: Google

#### ***Please let me know what you think about this kernel and if it is useful and you can leave an upvote I would be very grateful! :)***


### Approach

My focus here will be on visualizations to understand better this incridible index and answer the following questions:

 - How ibovespa grow over the years?
 - How the prices behave in important events in the local and international?
 - The retorns are normally distributed?
 - What is the anually average retorn?
 - What years IBOV perform better and worse?
 - What months IBOV perform better and worse?
 - What day of week IBOV perform better and worse?

### Loading Required Libraries, Functions and Datasets

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
from statsmodels.graphics.gofplots import qqplot
from pandas.plotting import register_matplotlib_converters

register_matplotlib_converters()

In [None]:
def moving_average(dados, p):
    '''
    This function calculate
    a simple average mean and has 2 parameters:
    
    1 - dados: The column with numerical prices or asset returns
    
    2 - p: the number of periods to create the average mean
    '''
    media = dados.rolling(p).mean()
    media = media.fillna(0)
    return media

In [None]:
def remove_outliers(data):
    '''
    This function removes outliers from 
    a pandas.core.series.Series object using
    boxplot IQR method:
    
    Parameters:
        data: a pandas.core.series.Series object with
        outliers to be removed
    '''
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    data = np.where(data>(Q3+1.5*IQR),(Q3+1.5*IQR),data)
    data = np.where(data<(Q1-1.5*IQR),(Q1-1.5*IQR),data)
    return data

In [None]:
# loading dataset and seeing first rows
ibov = pd.read_csv('../input/ibovespaibov-historical-data-from-1992-to-2019/ibov_data.csv')
ibov.head()

We have a column with dates, to better plot interpretation let's convert then into a datetime object

In [None]:
# converting date column to date format
ibov['date'] = ibov['date'].apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))

### Exploratory Data Analysis

#### How Bovespa index reacted to important moments?

Since 1992 the world had sofer many devastly crises, all crises left their mark on Bovespa, with falls great then 50% like "2008 Great depression" and "dot-com bubble", but always recovering relativily quickly if we compared with the most long bear market that Bovespa index has passed: **"The Dilma era"**.

Was almost 6 years in bear market then starts grow again with rumors and finaly impeachment in 2016.

In [None]:
# defyning the plotsize
plt.figure(figsize=(30,15))

# creating the mainplot with prices and dates
sns.lineplot(x=ibov['date']
             , y=np.log(ibov['close'])
             , color='blue')

# defyning title, labels and ticks parameters
plt.title('How Bovespa index reacted to important moments on the economy and politics?', size=40)
plt.xlabel('Years', size=25)
plt.ylabel('Close Price', size=25)
plt.xticks(size=20)
plt.yticks(size=20)

# adding vertical lines on plot
plt.axvline("1992-05-20", color='black'
            , linestyle="-"
            , linewidth=1
            , label='Black Wednesday')

plt.axvline("1994-08-25", color='darkgreen'
            , linestyle="-"
            , linewidth=1
            , label='The Mexican peso crisis')

plt.axvline("1997-07-01", color='darkred'
            , linestyle="-"
            , linewidth=1
            , label='Asian and russian financial crisis')

plt.axvline("2000-03-10", color='lightblue'
            , linestyle="-"
            , linewidth=1
            , label='dot-com bubble')

plt.axvline("2004-01-01", color='yellow'
            , linestyle="-"
            , linewidth=1
            , label='Commodities boom')

plt.axvline("2008-04-24", color='purple'
            , linestyle="-"
            , linewidth=1
            , label='2008 Great depression')

plt.axvline("2008-11-01", color='green'
            , linestyle="-"
            , linewidth=1
            , label='End of the Crisis')

plt.axvline("2010-01-01", color='red'
            , linestyle="-"
            , linewidth=1
            , label='Beginning of the Dilma era')

plt.axvline("2016-08-31", color='orange'
            , linestyle="-"
            , linewidth=1
            , label='Impeachment')

# setting parameters to events text
setup = dict(size = 20, color = "black")

# setting text with events
plt.text("1990-02-01", np.log(4000)
         , "Black Wednesday"
         , **setup)

plt.text("1992-07-01", np.log(7000)
         , "The Mexican peso crisis"
         , **setup)

plt.text("1994-07-01", np.log(16000)
         , "Asian and russian financial crisis"
         , **setup)

plt.text("2000-03-10", np.log(20000)
         , "dot-com bubble"
         , **setup)

plt.text("2004-01-01", np.log(16000)
         , "Commodities boom"
         , **setup)

plt.text("2004-03-16", np.log(80000)
         , "2008 Great depression"
         , **setup)

plt.text("2008-11-01", np.log(27000)
         , "End of the Crisis"
         , **setup)

plt.text("2008-11-01", np.log(27000)
         , "End of the Crisis"
         , **setup)

plt.text("2010-01-01", np.log(75000)
         , "Beginning of the Dilma era"
         , **setup)

plt.text("2016-08-31", np.log(35000)
         , "Impeachment"
         , **setup)

plt.text("2016-08-31", np.log(35000)
         , "Impeachment"
         , **setup)

# setting legend parameters
plt.legend(loc = "upper left"
           , frameon = True
           , fontsize = 18
           , ncol = 2 
           , fancybox = True
           , framealpha = 0.95
           , shadow = True
           , borderpad = 1);

#### Daily returns vs Average monthly returns

Let's take a look into daily returns and compare then with the average monthly returns to understand how great falls or ups are mitigated when we compare with averages. Looking into this just seems to tell us the mensage: ***Focus on long term, forget daily variations!***

In [None]:
# creating a new column with returns
ibov['returns'] = ibov['close']/ibov['open']-1

# creating a new column with monthly average returns
ibov['monthly_average_returns'] = moving_average(ibov['returns'],30)

ibov.head()

In [None]:
# defyning the plotsize
plt.figure(figsize=(20,10))

# creating the plot to returns
sns.lineplot(x='date'
             ,y='returns'
             ,data=ibov
             ,label='daily returns'
            ,color='blue')

# creating the plot to average returns
sns.lineplot(x='date'
             ,y='monthly_average_returns'
             ,data=ibov
             ,label='monthly average returns'
            ,color='orange')

# setting title, labels and ticks parameters
plt.title('Bovespa index returns over the years', size=30)
plt.xlabel('Years', size=20)
plt.ylabel('Returns', size=20)
plt.xticks(size=15)
plt.yticks(size=15)

# setting legend parameters
plt.legend(loc = "upper right"
           , frameon = True
           , fontsize = 18
           , ncol = 2 
           , fancybox = True
           , framealpha = 0.95
           , shadow = True
           , borderpad = 1);

#### Are returns normally distributed?

**NO** If we take a quick look into histogram plot we can be confused with this pretty curve, but some returns are above 0.2 and this problably are causing a "Right Skewed or Postive Skewed" problem, let's use a qqplot to confirm our suspicions.

In [None]:
plt.figure(figsize=(16,10))
sns.distplot(ibov['returns']
             , label = 'Distribution'
             , bins=200)

plt.title('Bovespa index returns distribution', size=30)
plt.xlabel('Frequency', size=20)
plt.ylabel('Returns', size=20)
plt.xticks(size=15)
plt.yticks(size=15)

plt.legend(loc = "upper right"
           , frameon = True
           , fontsize = 18
           , ncol = 2 
           , fancybox = True
           , framealpha = 0.95
           , shadow = True
           , borderpad = 1);

As we suspect above we have a "Right Skewed" problem with our returns and to modeling and predict it is advisable to process this data

In [None]:
# creating a qqplot
qqplot(ibov['returns'],line='s')

plt.title('Probability distribution',size=15)
plt.xlabel('Theoretical Quantiles',size=12)
plt.ylabel('Sample Quantiles',size=12)
plt.xticks(size=10)
plt.yticks(size=10);

To understand better this outliers we can take a look into a boxplot where we can see that the majority outliers are located below -0.1 and above 0.1, to let more simple to modeling and predict returns we can replace outliers using IQR boxplot method

In [None]:
plt.figure(figsize=(20,10))
sns.boxplot(ibov['returns'], color='orange')

plt.title('Bovespa returns distribution and outliers', size=30)
plt.xlabel('Returns', size=20)
plt.xticks(np.arange(-0.2,0.4,step=0.05), size=15);

#### Treating outliers
Let's see how our returns seems like without outliers...

In [None]:
#removing outliers
no_out_returns = remove_outliers(ibov['returns'])

# defyning the plotsize
plt.figure(figsize=(10,10))

# creating the plot
sns.boxplot(no_out_returns, color='green',orient='v')

plt.title('Bovespa returns distribution without outliers', size=25)
plt.ylabel('Returns', size=20)
plt.yticks(np.arange(-0.05,0.06,step=0.025), size=15);

Let's take a look into qqplot now... BAM! much interesting! Now we restringe the returns range and this can be extremely useful to predictions

In [None]:
# creating a qqplot
qqplot(no_out_returns,line='s')

plt.title('Probability distribution',size=15)
plt.xlabel('Theoretical Quantiles',size=12)
plt.ylabel('Sample Quantiles',size=12)
plt.xticks(size=10)
plt.yticks(size=10);

#### Acumulated returns and Average returns

With a average annual performance of 24.63% and acumulated returns great than 100% a year in some years and more than -30% in anothers, let's see how Bovespa index perform over the year with acumulated returns vs the average annual returns.

In [None]:
print("Average annual returns:",round(ibov['returns'].groupby(ibov['year']).sum().mean()*100,2),"%")

In [None]:
returns_by_year = pd.DataFrame(round(ibov['returns'].groupby(ibov['year']).sum()*100,2))
returns_by_year['pos_neg'] = returns_by_year>0

plt.figure(figsize=(20,10))
returns_by_year.returns.plot(kind='bar'
                      , color=returns_by_year.pos_neg.map({True: 'forestgreen', False: 'red'})
                      , label='returns')

average_returns = np.mean(returns_by_year['returns'])

plt.axhline(y=average_returns, color='blue', linestyle='-', label='Average returns')

plt.title('Acumulated returns by year vs average returns',size=30)
plt.xlabel('Years',size=20)
plt.ylabel('Acumulated in %',size=20)
plt.yticks(np.arange(-40,170,step=10), size=15)
plt.xticks(rotation=45, size=15)

plt.legend(loc = "upper right"
           , frameon = True
           , fontsize = 18
           , ncol = 2 
           , fancybox = True
           , framealpha = 0.95
           , shadow = True
           , borderpad = 1);

In [None]:
data = returns_by_year['pos_neg'].value_counts().values.tolist()
pos_neg = pd.DataFrame(data=data,columns=['counts'],index=['positive','negative'])

# difyning plot size
plt.figure(figsize=(10,10))

 # here i use .value_counts() to count the frequency that each category occurs of dataset
pos_neg['counts'].plot(kind='pie'
                       , colors=['lightgreen','red']
                       , autopct='%1.1f%%' # adding percentagens
                       , shadow=True
                       , startangle=140)

# defyning title and legend parameters
plt.title("What's the proportion of positive annual returns?",size=20)
plt.legend(loc = "upper right"
           , frameon = True
           , fontsize = 10
           , ncol = 2 
           , fancybox = True
           , framealpha = 0.95
           , shadow = True
           , borderpad = 1);

### Acumulated returns by month

January has accumulated the main gains over the years, while June and May are the only ones that have accumulated negative returns, October and August have not had good gains either

In [None]:
returns_by_month = pd.DataFrame(round(ibov['returns'].groupby(ibov['month']).sum()*100,2))
returns_by_month['pos_neg'] = returns_by_month>0

plt.figure(figsize=(20,10))
returns_by_month.returns.plot(kind='bar'
                      , color=returns_by_month.pos_neg.map({True: 'forestgreen', False: 'red'}))

plt.title('Best months by acumulated returns in %',size=30)
plt.xlabel('Months',size=20)
plt.ylabel('Acumulated in %',size=20)
plt.yticks(np.arange(-20,170,step=10), size=15)
plt.xticks(rotation=45, size=15);

In [None]:
# joining month and year columns in a new variables month_year
ibov["month_year"] = ibov["month"].map(str) + ibov["year"].map(str)

# calculating returns by month
returns_by_month = pd.DataFrame(round(ibov['returns'].groupby(ibov['month_year']).sum()*100,2))
returns_by_month['pos_neg'] = returns_by_month>0

In [None]:
plt.figure(figsize=(30,10))
returns_by_month.returns.plot(kind='bar'
                      , color=returns_by_month.pos_neg.map({True: 'forestgreen', False: 'red'})
                      , label='returns')

average_returns = np.mean(returns_by_month['returns'])

plt.axhline(y=average_returns, color='blue', linestyle='-', label='Average returns')

plt.title('Returns by month vs average monthly returns',size=30)
plt.xlabel('Months',size=20)
plt.ylabel('Acumulated in %',size=20)
plt.yticks(np.arange(-60,80,step=10), size=15)
plt.xticks(rotation=45, size=0)

plt.legend(loc = "upper right"
           , frameon = True
           , fontsize = 18
           , ncol = 2 
           , fancybox = True
           , framealpha = 0.95
           , shadow = True
           , borderpad = 1);

In [None]:
data2 = returns_by_month['pos_neg'].value_counts().values.tolist()
pos_neg2 = pd.DataFrame(data=data2,columns=['counts'],index=['positive','negative'])

# difyning plot size
plt.figure(figsize=(10,10))

 # here i use .value_counts() to count the frequency that each category occurs of dataset
pos_neg2['counts'].plot(kind='pie'
                       , colors=['lightgreen','red']
                       , autopct='%1.1f%%' # adding percentagens
                       , shadow=True
                       , startangle=140)

# defyning title and legend parameters
plt.title("What's the proportion of positive monthly returns?",size=20)
plt.legend(loc = "upper right"
           , frameon = True
           , fontsize = 10
           , ncol = 2 
           , fancybox = True
           , framealpha = 0.95
           , shadow = True
           , borderpad = 1);

#### Better days of the week

In [None]:
returns_by_week_day = pd.DataFrame(round(ibov['returns'].groupby(ibov['day_of_week']).sum()*100,2))
returns_by_week_day['pos_neg'] = returns_by_week_day>0

plt.figure(figsize=(20,10))
returns_by_week_day.returns.plot(kind='bar'
                      , color=returns_by_week_day.pos_neg.map({True: 'forestgreen', False: 'red'}))

plt.title('Best days of week by acumulated returns %',size=30)
plt.xlabel('Days',size=20)
plt.ylabel('Acumulated in %',size=20)
plt.yticks(np.arange(-70,350,step=20), size=15)
plt.xticks(rotation=45, size=15);

#### Daily returns proportion

In [None]:
# calculating positive and negatie daily returns
daily_returns = pd.DataFrame(round(ibov['returns'].groupby(ibov['date']).sum()*100,2))
daily_returns['pos_neg'] = daily_returns>0

data3 = daily_returns['pos_neg'].value_counts().values.tolist()
pos_neg3 = pd.DataFrame(data=data3,columns=['counts'],index=['positive','negative'])

# difyning plot size
plt.figure(figsize=(10,10))

 # here i use .value_counts() to count the frequency that each category occurs of dataset
pos_neg3['counts'].plot(kind='pie'
                       , colors=['lightgreen','red']
                       , autopct='%1.1f%%' # adding percentagens
                       , shadow=True
                       , startangle=140)

# defyning title and legend parameters
plt.title("What's the proportion of positive daily returns?",size=20)
plt.legend(loc = "upper right"
           , frameon = True
           , fontsize = 10
           , ncol = 2 
           , fancybox = True
           , framealpha = 0.95
           , shadow = True
           , borderpad = 1);

### Conclusions

The Bovespa index has passed for many complicated periods with national and international economic crises, the most large period of ressession was in "Dilma Era" from 2010 to 2016;

The daily returns are not normaly distributed and dealing with skewness is desirable before going to modeling, one good approach is use boxplot IQR method to replace outliers;

The better year to bovespa index in terms of performance was in 1993 and to worse in 2008, the largest period in recession start in 2010 and ends in 2016 with the end of "Dilma Era";

January is the best month to Bovespa index with more than 150% of acumulated positive returns and the worse is June with something like -8% of acumulated negative returns;

Friday is the better day to Bovespa index and Monday the worse;

When we look at the long term, we tend to have more positive returns, with 67.9% looking at annual returns, 62.1% looking at monthly returns and only 51.6% looking at daily returns, trading daily returns without an incredible strategy can be like being in a casino in Las Vegas.

### Thank you very much for read this kernel, let me know what you think about this kernel, leave a comment and if you think that this kernel was useful please give a upvote, i really appriciate that :)
<img src='https://i.imgur.com/2Ft7VUy.gif' style='width:350px;height:300px'/>

**Figure 2**: HOLD/SELL roulette. Source: Google