### About Dataset

* 1886 observations by Galton on 934 children and their 205 families.

* The main goal of this study was to establish a relationship between children and parents' heights. Galton also wanted to find out whether marriage selection indicates a relationship between a husband's and his wife's heights

Feature Descriptions: 
* rownames: Index or row identifier for each entry.
* family: Family identifier or label.
* father: Numeric data representing the height of the father.
* mother: Numeric data representing the height of the mother.
* midparentHeight: Mid-parent height calculated as  (father + 1.08*mother)/2
* children: Number of children in the family.
* childNum: Number of the child within the family, listed by height order (boys first, then girls).
* gender: Gender of the child (text format).
* childHeight: Height of the child.

https://www.kaggle.com/datasets/jacopoferretti/parents-heights-vs-children-heights-galton-data

### Objective

To analyze height data from families, including parents and children, to understand how parental heights relate to the heights of their offspring

### Snapshot

![Snapshot](picII.png)

### Library

In [10]:
import pandas as pd             
import numpy as np             
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns       

In [11]:
df= pd.read_csv('GaltonFamilies.csv')

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 934 entries, 0 to 933
Data columns (total 9 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   rownames         934 non-null    int64  
 1   family           934 non-null    object 
 2   father           934 non-null    float64
 3   mother           934 non-null    float64
 4   midparentHeight  934 non-null    float64
 5   children         934 non-null    int64  
 6   childNum         934 non-null    int64  
 7   gender           934 non-null    object 
 8   childHeight      934 non-null    float64
dtypes: float64(4), int64(3), object(2)
memory usage: 65.8+ KB


In [13]:
df.shape

(934, 9)

In [14]:
df.describe()

Unnamed: 0,rownames,father,mother,midparentHeight,children,childNum,childHeight
count,934.0,934.0,934.0,934.0,934.0,934.0,934.0
mean,467.5,69.197109,64.089293,69.206773,6.171306,3.585653,66.745931
std,269.766875,2.476479,2.290886,1.80237,2.729025,2.36141,3.579251
min,1.0,62.0,58.0,64.4,1.0,1.0,56.0
25%,234.25,68.0,63.0,68.14,4.0,2.0,64.0
50%,467.5,69.0,64.0,69.248,6.0,3.0,66.5
75%,700.75,71.0,65.875,70.14,8.0,5.0,69.7
max,934.0,78.5,70.5,75.43,15.0,15.0,79.0


In [15]:
df.columns 

Index(['rownames', 'family', 'father', 'mother', 'midparentHeight', 'children',
       'childNum', 'gender', 'childHeight'],
      dtype='object')

In [16]:
print(df.isnull().sum())

rownames           0
family             0
father             0
mother             0
midparentHeight    0
children           0
childNum           0
gender             0
childHeight        0
dtype: int64


## EDA CYCLE

In [18]:
gender_c = df['gender'].value_counts()

# Plotting the pie chart
plt.figure(figsize=(4, 4))
plt.pie(gender_c, labels=gender_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Gender')
plt.show()

NameError: name 'gender_counts' is not defined

<Figure size 400x400 with 0 Axes>

### Interpretation


From the above pie chart we can visualize the distribution of genders in dataset with males constitute 51.5% while females make up 48.5%, indicating a slight majority of males.

In [None]:
sns.histplot(df['childHeight'], bins = 12, color = 'lightgreen', kde = True, edgecolor = 'green')

# Add labels
plt.title('Child Height')
plt.xlabel('Childheight')
plt.ylabel('Frequency')

# Show the plot
plt.show()

### Interpretation

* The above histogram is slightly right skewed.
* Highest occurance of child height is around 65.

In [None]:
# Checking the data type as scatterplot excepts either integer and float
print(df['midparentHeight'].dtype)
print(df['childHeight'].dtype)


In [None]:
sns.scatterplot(x=df['midparentHeight'], y=df['childHeight'])  
plt.title('Scatterplot of Mid Parent Height and Child Height')    
plt.xlabel('Mid Parent Height')                                
plt.ylabel('Child Height')                          
plt.show()

In [None]:
sns.scatterplot(x=df['midparentHeight'], y=df['childHeight'], hue=df['gender'])  
plt.title('Scatterplot of Mid Parent Height and Child Height (With Gender)')    
plt.xlabel('Mid Parent Height')                                
plt.ylabel('Child Height')                          
plt.show()

### Interpretation

The Scatterplot shows a noticeable positive correlation between parent height and child height. As parent height increases, there tends to be an increase in child height.

In [None]:
plt.figure(figsize=(10,7))  # Adjust the figure width
sns.boxplot(x=df['gender'], y=df['childHeight'])

plt.xlabel ('Gender')
plt.ylabel('Child Height')
plt.title('Height by Gender')
plt.show()

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(data=df, x='gender', y='childHeight')
plt.title('Bar Plot of Average Child Height by Gender')
plt.xlabel('Gender')
plt.ylabel('Average Child Height')
plt.show()

### Interpretation

* Across the scatter plot, box plot, and bar graph above, a trend is observed where male children tend to show slightly higher average heights.
* This data suggests that the average height for males at that time was around 69 inches, whereas for females it was around 64 inches.

In [None]:
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Matrix Heat Map')
plt.show()

### Interpretation

* The heatmap demonstrates a clear trend where taller parents generally correspond to taller children. This relationship holds consistently across various demographic groups and genders, underscoring its robust nature.

* The correlation coefficient of 0.32 between midparent height and child height indicates a moderate positive relationship. This suggests that as midparent height increases, there is a tendency for child height to also increase, albeit moderately. This finding implies that taller parents tend to have taller children, highlighting the influence of parental height on offspring stature. While this correlation underscores a significant trend, it's important to consider that other factors, such as genetic variations and environmental influences, also contribute to the variability in child height.

## Conclusion

From this analysis of Galton's dataset:

* We can observe a moderate positive correlation (correlation coefficient of 0.32) between midparent height and child height, indicating that taller midparent heights tend to correlate with taller child heights.
* This relationship is consistent across different genders, with both male and female children showing similar trends in height correlation with parental height.
* Visualizations such as scatter plots and heatmaps confirm that taller parents generally have taller children, providing robust evidence of this trend.
* Galton's dataset underscores the significance of parental genetics in determining offspring height, a finding that has enduring implications in genetics and family health studies.
* Further exploration into factors beyond genetics, such as environmental influences and nutrition, could provide deeper insights into variations in child height.