# Univariate Analysis

1. Continuous Data Plots:

    * Univariate Scatter Plot(w/ or w/o hue parameter)
        * plt.scatter()
        * sns.scatterplot()
        
        ![](https://miro.medium.com/max/372/1*9QNixJ5eRRe8NjAMjEYFjQ.png)
    
    * Line Plot(with markers) - *It is similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined with straight line segments.*
        * plt.plot()
        * sns.lineplot()
        
        ![](https://miro.medium.com/max/460/1*3AATuqbjcp3xMiTkuWwNPw.png)
    
    * Strip plot - *The strip plot is similar to a scatter plot. It is often used along with other kinds of plots for better analysis. It is used to visualize the distribution of data points of the variable.* 
        * sns.stripplot(y = df['col1'])
        * sns.stripplot(x = df['col1'] ,y = df['col2'])
        
        ![](https://miro.medium.com/max/332/1*YysiqzshO8EH03-0mQbOYQ.png)
        ![](https://miro.medium.com/max/332/1*OvLxUGVF8HoIB9eN9_rrCw.png)
        
    * Swarm plot - *The swarm-plot, similar to a strip-plot, provides a visualization technique for univariate data to view the spread of values in a continuous variable. The only difference between the strip-plot and the swarm-plot is that the swarm-plot spreads out the data points of the variable automatically to avoid overlap and hence provides a better visual overview of the data.*
        * sns.swarmplot(x = df['col1'])
        * sns.swarmplot(x = df['col1'], y = df['col2'])
        
        ![](https://miro.medium.com/max/298/1*LogSEhTZuhuHgytpbP5tfg.png)
        ![](https://miro.medium.com/max/332/1*5NKuFx60PNUFPDKT9CN-GA.png)
        
    * Histogram - *Histograms are similar to bar charts which display the counts or relative frequencies of values falling in different class intervals or ranges. A histogram displays the shape and spread of continuous sample data.*
        * plt.hist(df['col'])
        * sns.histplot(df['col'], kde=False, color='black', bins=10)
        
        ![](https://miro.medium.com/max/726/1*XUZLqINB4fJskto9CVNQOw.png)
    
    * Desnity Plot/ KDE Plot - *A density plot is like a smoother version of a histogram. Generally, the kernel density estimate is used in density plots to show the probability density function of the variable.*
        * plt.plot(df['col'], kind = 'density')
        * sns.kdeplot(df['col'], shade=True)
        
        ![](https://miro.medium.com/max/356/1*inbubvWs2Wi6cizmLEyIvA.png)
        ![](https://miro.medium.com/max/324/1*wcr6xe6VSjqkbF1al9Oi9Q.png)
        
    * Box Plot - *A box-plot is a very useful and standardized way of displaying the distribution of data based on a five-number summary (minimum, first quartile, second quartile(median), third quartile, maximum). It helps in understanding these parameters of the distribution of data and is extremely helpful in detecting outliers.*
    
    ![](https://miro.medium.com/max/628/1*FPnhYs6cs3ipUKIZhl9caA.png)
    
        * plt.boxplot(df['col'])
        * sns.boxplot(df['col'])
        * sns.boxplot(x = 'variable', y = 'value', data=df) # for mutiple boxplots in one plot
        
    ![](https://miro.medium.com/max/298/1*aiGFPKBEHIQsm5VMN9yDHw.png)
    ![](https://miro.medium.com/max/546/1*0x_qAsQ0ZblqfTgXS9cErg.png)
    
    * Distplot - *The distplot() function of seaborn library was earlier mentioned under rug plot section. This function combines the matplotlib hist() function with the seaborn kdeplot() and rugplot() functions.*
        * sns.distplot(df['col'], rug=True)
        
        ![](https://miro.medium.com/max/380/1*7xqUshvBswGn88mOS9Mdmw.png)
    
    * Violin Plot - *The Violin plot is very much similar to a box plot, with the addition of a rotated kernel density plot on each side. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared.*
    
    ![](https://miro.medium.com/max/241/1*5TYiYasGcxFLK4ThlgIiXA.png)
    
        * plt.violinplot(df.values, showmedians=True)
        * sns.violinplot(df['col'], orient='vertical')
        * sns.violinplot(x = df['variety'], y = df['petal.width'], data = df) # for mutiple violinplots in one plot
        
    ![](https://miro.medium.com/max/332/1*kgzy9FEnfvjQiFUi8uq7Ew.png)
    ![](https://miro.medium.com/max/555/1*6BefklySIxTT4qfbtT1qTg.png)
        
        <br><br><br><br><br><br>
2. Continuous Data Plots:
    * Bar Chart/ Count plot - *The bar plot is a univariate data visualization plot on a two-dimensional axis. One axis is the category axis indicating the category, while the second axis is the value axis that shows the numeric value of that category, indicated by the length of the bar.*
        * df['variety'].value_countd().plot.bar()
        * sns.countplot(df['variety'])
        
        ![](https://miro.medium.com/max/315/1*3CfkP0AckHPAOjkXoxCYkw.png)
        ![](https://miro.medium.com/max/329/1*jgV7izuVqOW8cn87pjU3UQ.png)
        
    * Pie chart - *A pie chart is the most common way used to visualize the numerical proportion occupied by each of the categories.*
        * plt.pie(df['variety'].value_counts(), labels = ['A', 'B', 'C'])
        * plt.pie(df['variety'].value_counts(), labels = ['A', 'B', 'C'], autopct='%.3f') # to show percentages in each section of pie chart
        
        ![](https://miro.medium.com/max/341/1*9BYYMKkNbWQtYyMe0fO2DA.png)
        ![](https://miro.medium.com/max/316/1*FnySNKmCFf-SljzRVmI5HQ.png)

# Bivariate Analysis

1. Scatter Plots
2. Hex Plots
2. Stacked Bar Chart
3. Bivariate line chart

### Continuous & Continuous:

1. **Scatter Plot** : *While doing bi-variate analysis between two continuous variables, we should look at scatter plot. It is a nifty way to find out the relationship between two variables. The pattern of scatter plot indicates the relationship between variables. The relationship can be linear or non-linear.*

    * `plt.scatter(x = df['col1'], y = df['col2'])`
    * `sns.scatterplot(x = 'col1', y = 'col2', data = df)`

    ![](https://www.analyticsvidhya.com/wp-content/uploads/2015/02/Data_exploration_4.png)

Because of their weakness to overplotting, scatter plots work best with relatively small datasets, and with variables which have a large number of unique values. There are a few ways to deal with overplotting. One way : sampling the points. Another interesting way to do this that's built right into pandas is to use our next plot type, a hexplot.

2. **Hex Plot** : *A hex plot aggregates points in space into hexagons, and then colors those hexagons based on the values within them. More the data points lying inside a hexagon, darker in color it is.*
    
    * `plt.hexbin(x = df['price'], y = df['points'], gridsize=15)`
    * `plt.hexbin(x, y, gridsize=50, cmap='inferno')`
    * `plt.hexbin(x, y, gridsize=50, bins='log', cmap='inferno')`
    
    * Parameters : 
        * gridsize = n => 'n' number of hexagons in the x-direction.
        * gridsize = (n,m) => 'n' number of hexagons in the x-direction, 'm' number of hexagons in the y-direction.
        * bins : 'log' or int or sequence, default: None
            * If None, no binning is applied; the color of each hexagon directly corresponds to its count value.
            * If 'log', use a logarithmic scale for the color map. Internally, log10(i+1) is used to determine the hexagon color. This is equivalent to norm=LogNorm().
            * If an integer, divide the counts in the specified number of bins, and color the hexagons accordingly.
            * If a sequence of values, the values of the lower bound of the bins to be used.
            
            ![](https://www.kaggleusercontent.com/kf/5832509/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..igGQ27UlRkUW2QlETpp7Ng.KxtS1EnDAMyKXaDwx_21XYt-YEIynEpcg_v1GQWK4KnzGEOxyPclKduYx512e2HS1Ots4_Nb3OCII2Evy57BsVALfSla6T5gATfZrjoyOONkT9K_e895k58SUs1inr7J1QnTifISc7jKniZtyVppFAIEQ3RKsguDPmseETM8ZOSvMw_bZsclFTSCNcZKhrQHumsyDP9POc8V3pmKo_N4Fkj3aQkuBeR5wHOTt9xNFyCePGoxMDVFc8KQwvpEu_UncZEntA9SxWy_FIO8GmJgJVCUeJBx5d4pQNxgKlmemiCkjDozWBzCHc7IR111vmbIdbAP8Pbm71RFdXV5HK6tVghxRbQYP974AT-p6B5ACsWnttUpDZ6FD5xJ5i92wjlmjb_u_CqqHux-OVCG7BpRymJGTI_-PaXynwNB_795RlSbJ8xn-yIS-X125uR84WCbikOMCeZpLtxKxBY4mMb-FDByOYjuc9OZo6JEjv9UiVYfaxD_jABeLN7vFYeHVuoen-8_dCTL1s7yYAcQ1R8F3Btex1h-4gH0GxZo2MY9G4iYJUwruoAAyew8EB2J7N2JFq2nQez0eJmIN3T47UINBCTUIYo80rErwMFZ9IUYKyiO6SaLIBAg24_OAE3SpVByGWhB2PhbQJp8cTSN2ddtHQJM9uS8g9c3b7Amybfjb3w.hjrjHhIgR4NrpSIWVCk1_g/__results___files/__results___8_1.png)
            ![](https://matplotlib.org/3.1.0/_images/sphx_glr_hexbin_demo_001.png)

### Correlation:

Scatter plot shows the relationship between two variable but does not indicates the strength of relationship amongst them. To find the strength of the relationship, we use Correlation. Correlation varies between -1 and +1.

* -1: perfect negative linear correlation
* +1:perfect positive linear correlation and 
* 0: No correlation

Correlation can be derived using following formula:

`Correlation = Covariance(X,Y) / SQRT(Var(X)* Var(Y))`

### Categorical & Categorical:

* **Two-way table:** We can start analyzing the relationship by creating a two-way table of count and count%. The rows represents the category of one variable and the columns represent the categories of the other variable. We show count or count% of observations available in each combination of row and column categories.

* **Stacked Column Chart:** This method is more of a visual form of Two-way table.

    * `plt.bar(df, stacked=True)`
    * `df.plot.bar(stacked=True)`
    * https://python-graph-gallery.com/12-stacked-barplot-with-matplotlib/

    ![](https://www.analyticsvidhya.com/wp-content/uploads/2015/02/Data_exploration_6-850x152.gif)
    
    * *Limitation:*
        * The first limitation is that the second variable in a stacked plot must be a variable with a very limited number of possible values (probably an ordinal categorical, as here). Five different types of wine is a good number because it keeps the result interpretable; eight is sometimes mentioned as a suggested upper bound. Many dataset fields will not fit this critereon naturally, so you have to "make do", as here, by selecting a group of interest.

        * The second limitation is one of interpretability. As easy as they are to make, and as pretty as they look, stacked plots make it really hard to distinguish concrete values. For example, looking at the plots above, can you tell which wine got a score of 87 more often: Red Blends (in purple), Pinot Noir (in red), or Chardonnay (in green)? It's actually really hard to tell!
       
    **Line Plot**
    * `plt.plot(df)` a df having rows and columns as categorical features
    * `sns.lineplot(x="year", y="passengers", data=may_flights)`
    
    ![](https://www.kaggleusercontent.com/kf/5832509/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..igGQ27UlRkUW2QlETpp7Ng.KxtS1EnDAMyKXaDwx_21XYt-YEIynEpcg_v1GQWK4KnzGEOxyPclKduYx512e2HS1Ots4_Nb3OCII2Evy57BsVALfSla6T5gATfZrjoyOONkT9K_e895k58SUs1inr7J1QnTifISc7jKniZtyVppFAIEQ3RKsguDPmseETM8ZOSvMw_bZsclFTSCNcZKhrQHumsyDP9POc8V3pmKo_N4Fkj3aQkuBeR5wHOTt9xNFyCePGoxMDVFc8KQwvpEu_UncZEntA9SxWy_FIO8GmJgJVCUeJBx5d4pQNxgKlmemiCkjDozWBzCHc7IR111vmbIdbAP8Pbm71RFdXV5HK6tVghxRbQYP974AT-p6B5ACsWnttUpDZ6FD5xJ5i92wjlmjb_u_CqqHux-OVCG7BpRymJGTI_-PaXynwNB_795RlSbJ8xn-yIS-X125uR84WCbikOMCeZpLtxKxBY4mMb-FDByOYjuc9OZo6JEjv9UiVYfaxD_jABeLN7vFYeHVuoen-8_dCTL1s7yYAcQ1R8F3Btex1h-4gH0GxZo2MY9G4iYJUwruoAAyew8EB2J7N2JFq2nQez0eJmIN3T47UINBCTUIYo80rErwMFZ9IUYKyiO6SaLIBAg24_OAE3SpVByGWhB2PhbQJp8cTSN2ddtHQJM9uS8g9c3b7Amybfjb3w.hjrjHhIgR4NrpSIWVCk1_g/__results___files/__results___20_1.png)
    
    This beats the second limitation of stacked bar plots : interpretability.

* **Chi-Square Test:** *This test is used to derive the statistical significance of relationship between the variables. Also, it tests whether the evidence in the sample is strong enough to generalize that the relationship for a larger population as well. Chi-square is based on the difference between the expected and observed frequencies in one or more categories in the two-way table. It returns probability for the computed chi-square distribution with the degree of freedom.*

Probability of 0: It indicates that both categorical variable are dependent

Probability of 1: It shows that both variables are independent.

Probability less than 0.05: It indicates that the relationship between the variables is significant at 95% confidence. 

The chi-square test statistic for a test of independence of two categorical variables is found by:
![](https://www.analyticsvidhya.com/wp-content/uploads/2015/02/Data_exploration_7.png)

where O represents the observed frequency. E is the expected frequency under the null hypothesis and computed by:
![](https://www.analyticsvidhya.com/wp-content/uploads/2015/02/Data_exploration_8.png)

From previous two-way table, the expected count for product category 1 to be of small size is  0.22. It is derived by taking the row total for Size (9) times the column total for Product category (2) then dividing by the sample size (81). This is procedure is conducted for each cell.

### Categorical & Continuous:

* **Box Plot:** While exploring relation between categorical and continuous variables, we can draw box plots for each level of categorical variables.

If levels are small in number, it will not show the statistical significance. To look at the statistical significance we can perform Z-test, T-test

* **Z-Test/ T-Test:** Either test assess whether mean of two groups are statistically different from each other or not.

![](https://www.analyticsvidhya.com/wp-content/uploads/2015/02/ztestformula1.jpg)

If the probability of Z is small then the difference of two averages is more significant. The T-test is very similar to Z-test but it is used when number of observation for both categories is less than 30.

![](https://www.analyticsvidhya.com/wp-content/uploads/2015/02/ttest.png)