# Primary analysis and preprocessing of data with Pandas

In this task, we will consider the primary analysis of data using real data about chocolate bars.

The data contains attributes:
* 'company' - manufacturer company
* 'bar_name' - name of the chocolate bars
* 'ref' - number
* 'rew_date' - evaluation date
* 'percent' - percentage of cocoa beans
* 'company_loc' - company's location
* 'rate '- rating
* 'bean_dtype '- type of cocoa beans
* 'bean_orig '- country of origin.

## Numerical Attributes

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import seaborn as sns
%matplotlib  inline
sns.set(style='darkgrid')

Read the data from the CSV file in which there are no headers

In [None]:
best_bar= pd.read_csv('flavors_of_cacao.csv',
                       sep=',', header=0, names=['company', 'bar_name','ref', 'rew_date', 'percent', 'company_loc', 'rate', 'bean_dtype', 'bean_orig'])

In [None]:
type(best_bar)

Leave only company's location, rating, percentage of cocoa beans and country of origin for further analysis.

In [None]:
best_bar = best_bar[['company_loc', 'rate', 'percent', 'bean_orig']]
best_bar.head() 

Plot a histogram of distribution of bar's rating through plot() in which the number of columns is 20 (bins parametr). 
A histogram divides the variable into bins, counts the data points in each bin, and shows the bins on the x-axis and the counts on the y-axis. 

Plot a probability density function, which is constructed on the basis of an estimate of the continuous probability distribution. The histograms are approximated by a kernel's combination, i.e. more simple distributions, for example, normal (Gaussian). Therefore, the density estimate is also called kernel density estimation. You can plot kde graph using plot function with kind = 'kde' parameter.

In [None]:
# code here
best_bar.rate.plot(kind='hist',bins = 20,title='Histogram of distribution of bar\'s rating')

Plot a distribution histogram for the percentage of cocoa beans. 
You need to preprocess this feature because it has the object type (strings). 

In [None]:
# code here


Convert this attribute to float. Firstly you need to remove the '%' character at the end of each row in this column, and then convert it to float data type. You can use *.apply* with *axis=1* to send every single row to a function.
We can see that '%' is the end of each string value. It means that we must take all symbols except the last one (it will be such slice [:-1]).

best_bar['percent'] = best_bar['percent'].apply(lambda x: x[:-1])

After removing '%' you should convert string value into float.

best_bar['percent'] = best_bar['percent'].astype(float)

In [None]:
# code here


Plot box-and-whisker diagram for numerical attributes using the function *.boxplot()* from seaborn library or method of Pandas. Also use *.describe()* function for each distribution.

In [None]:
# code here


## Categorical Attributes

Let's pay attention to categorical attributes. You can't plot a distribution histogram, but you can use the *.value_counts()* function, which returns object containing counts of unique values in the attribute. 

Pay attention to the bean's countries of origin, which are rarely found in the dataset. 
The *.head()* and *.tail()* functions are used to output the start and end elements, respectively.

In [None]:
# code here


After you got object containing counts of unique values, you can visualize their distribution using the *.plot(kind = 'bar')* function.

In [None]:
# code here


According to the histogram of countries of origin distribution, we can see that there is chaos in the data. For some objects, the countries of origin are written with mistakes, some objects have missing data, and there are chocolate bars consisting of cocoa beans from different countries of origin. Because of that it is necessary to preprocess data for further analysis. 

Firstly, remove the objects with missing data using the *.dropna()* function.

You need to use *'axis'* parametr, which can take values:
* 0 : Drop rows which contain missing values.
* 1 : Drop columns which contain missing value.

and *'how'* parametr, where 
* 'any' : If any NA values are present, drop that row or column.
* 'all' : If all values are NA, drop that row or column.

In [None]:
# code here


Not all empty records from attribute was deleted. It means that empty records are not empty. 

Print all unique values of the *bean_orig* attribute using the *.unique()* function and find the value of the empty record.

In [None]:
# code here


You can use the following construction to exclude this element: best_bar ['bean_orig']! = 'Element value', which returns a binary array of elements,
where 
* True - the object's attribute doesn't take the 'element value', 
* False - it takes the 'element value'. 

Then the received binary array is transmitted as a mask, which tells what objects from our dataframe will be selected.

In [None]:
# code here


In addition, it's necessary to solve the problem with bars consisting of different cocoa beans. The best way is to process and separate these compound beans. But in view of their small contribution, we'll make them one value named 'complex'. We will consider this country of origin as rare if there are not more than five such samples in the data.

In [None]:
# code here


Plot a circular statistical diagram for company's location using the *.pie()* function.

In [None]:
# code here


## Pairwise distributions

In [7]:
from sklearn import preprocessing
from seaborn import pairplot

Plot pairwise distributions for all attributes.

In [None]:
# code here


Determine where the best cocoa beans grow, calculate the average value and median of the bar rating for each country of origin. Write the best three in both cases.

In [None]:
# code here


Determine where the best chocolate bars are produced, calculate the average value and median bar score for each company's location. Write the top three in both cases.

In [None]:
# code here