# Analysis of Baseball Performances


### Introduction 

In this project we will study and create an explanatory data visualization of a data-set containing 1,157 baseball players including their handedness (right, left handed, or both), height (in inches), weight (in pounds), home runs, and batting average. In our analysis, we are using Tableau to create visualization that may lead us to new insights about the performances of the baseball players in our data. 


In the process of analyzing the data, Tableau Public was used to explore visually the relationships between variables such as batting average, handedness, home runs, height, and weight. 

New variables were also created in this analysis. In this study, we're going to use the height/weight ratio and  BMI calculated using the following formula: BMI = (Weight in Pounds / (Height in inches x Height in inches)) x 703. 


In this visual analysis layout was applied as a story. Legends were incorporated in box and whiskers plots. In terms of chart type, scatterplots, bar graphs, box and whiskers plots, and area graphs were used. 


In terms of visual interaction, tooltip and transition were used. 



### Experimenting with Different Visualization Techniques

Initially, histograms were created for analyzing the batting average versus height/weight and BMI versus batting average. However, these variables are quantitaive, and after getting a feedback suggestion from my mentor Dominic, I decided on creating scatterplots for these quantitative variables. It was found that height/weight ratio of 0.4 had the best average batting score of 0.296 and players with BMI of approximately 23 had the best batting average score of approximately 0.3. 



I wanted to also see if area graphs can be used to study the relationships between some of the variables. I first attempted to create an area graph using height versus total home runs. However, because I was summing the total home runs per each different height integer, I concluded that this graph did not represent the data very well. For example, it appeared that there was a great variance between the people with height (in inches) 69 and 72. But this is because the total home runs were summed. Therefore, I modified this visual representation by using the average of total home runs per each height value. I also had to filter out the 0 home runs within the original dataset because when the 0s were included, the graph became very jagged and noisy. This data cleaning resulted in an improvement on the smoothing of the area curve. According to this modified graph comparing height versus the average home runs, people with the height of 67 inches had the best average of total home runs. 



Below are the codes I used to clean the data:

In [1]:
# importing the libraries 

import pandas as pd
import numpy as np

#loading the data set
data = pd.read_csv("Cleaned data.csv")

In [2]:
#display the data 
data

Unnamed: 0,Homeruns,Avg,height
0,7,0.219,72
1,90,0.245,74
2,5,0.259,74
3,5,0.251,71
4,0,0.000,70
5,0,0.000,73
6,21,0.257,70
7,59,0.234,73
8,20,0.261,71
9,17,0.230,74


In [3]:
## In this cell, we define a function which will perform a type smoothing to our data set. 
## We are analyzing Home Runs vs Height as an area graph.
##Since we cannot produce the area graph from the original data because it is too spiky, we're combining homerun values that share the same height value, thus, reducing the number of spikes to smooth out the curve.

def smoothing(pandasDataFrame):
    # change the data format from a panda data frame to an numpy array 
    M = pandasDataFrame.values
 
    # Now we want to extract the heights column and sort it from smallest values to largest values

    heights = sorted(set(M[:,2]))

    # defining a new variable to store our running total of home runs
    total = []

    # we are looping all the values in the heights column
    for height in heights:
        
    # this is the list of homeruns for a single height value. 
        homeruns = []
        
    # looping over the rows in the actual data set
    
        for row in M:
            
    # extracting and isolating each component of the current row
    
            hr,ave,ht = row
        
    # checking to see if the height value from the inner loop matches with value from the the outer loop
    
            if ht==height:
            
    # in this case saving the height value from the inner loop to our homeruns variable
    
                homeruns.append(hr)
        
    # We're adding the values in the homeruns list to get their means 
    # add this mean to the total variable
    
        total.append(np.mean(homeruns))

    
    return np.transpose([heights, total])

In [4]:
y=smoothing(data)
print(y)

[[  65.           41.        ]
 [  66.           18.        ]
 [  67.          100.        ]
 [  68.           14.18181818]
 [  69.           26.46341463]
 [  70.           39.49107143]
 [  71.           47.2137931 ]
 [  72.           42.30172414]
 [  73.           50.9742268 ]
 [  74.           61.13888889]
 [  75.           49.57024793]
 [  76.           34.        ]
 [  77.            1.7       ]
 [  78.           40.18181818]
 [  79.           13.66666667]
 [  80.            2.5       ]]


In [5]:
np.savetxt(r'newdata.txt', np.array(y), fmt='%d')

When analyzing handedness, it appeared that handedness does not significantly affect the batting average score. However, when analyzing home run score versus handedness, the lefthanded batters had the highest home run scores with a median of 23.5. The right handed batters' home run median score was 14 while both handed batters' median score was 13.


When studying height versus the moving average of home runs, it appeared that people with the height of 72 inches had the best home run scores.



#Below is the first draft: 

https://public.tableau.com/profile/grace4573#!/vizhome/BaseballProject_2/Story1
    
#Below is the final version of the graphs I selected: 

https://public.tableau.com/profile/grace4573#!/vizhome/BaseballProjectFinalVersion/Story1


### Conclusion

In conclusion, handedness does not seem to significantly affect the players' performance. 

When analyzing the relationship between the average of total home runs per each height value, players with the height of 67 inches displayed the highest results in terms of home runs. 

The height/weight ratio of 0.4 had the best average batting score and players with BMI of approximately 23 had the best batting average score.

When observing the relationship between height versus the moving average of home runs, people with the height of 72 inches performed best. 

