# Measles Vaccination Rates Project

This data contains the overall and measles, mumps, and rubella immunization rates for schools across the United States. Each row corresponds to one school and includes a number of variables including the latitude, longitude, name, and vaccination rates.

The dataset contains the following columns:

index: An identifier for each row.
    state: The state where the school is located.
year: The academic year for which the data was collected.
name: The name of the school.
type: The type of the school (e.g., public, private, charter).
city: The city where the school is located.
county: The county where the school is located.
district: The district where the school is located.
enroll: The number of students enrolled in the school.
mmr: The MMR vaccination rate at the school.
overall: The overall vaccination rate at the school.
xrel, xmed, xper: These columns seem to have many missing values. Without more information about what they represent, it's difficult to know how they could be used in the analysis.
lat, lng: The latitude and longitude of the school.

In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('measles.csv')

In [3]:
data.head()

Unnamed: 0,index,state,year,name,type,city,county,district,enroll,mmr,overall,xrel,xmed,xper,lat,lng
0,1,Arizona,2018-19,A J Mitchell Elementary,Public,Nogales,Santa Cruz,,51.0,100.0,-1.0,,,,31.347819,-110.938031
1,2,Arizona,2018-19,Academy Del Sol,Charter,Tucson,Pima,,22.0,100.0,-1.0,,,,32.221922,-110.896103
2,3,Arizona,2018-19,Academy Del Sol - Hope,Charter,Tucson,Pima,,85.0,100.0,-1.0,,,,32.130493,-111.117005
3,4,Arizona,2018-19,Academy Of Mathematics And Science South,Charter,Phoenix,Maricopa,,60.0,100.0,-1.0,,,,33.485447,-112.130633
4,5,Arizona,2018-19,Acclaim Academy,Charter,Phoenix,Maricopa,,43.0,100.0,-1.0,,2.33,2.33,33.49562,-112.224722


In [4]:
data['overall'].unique()

array([-1., 96., 99., ..., 10., 47.,  8.])

In [6]:
#count how many times each unique value of the overall column appears
data['overall'].value_counts()

-1.00      20177
 100.00     3172
 98.00      2112
 95.00      1798
 99.00       806
           ...  
 74.49         1
 70.42         1
 77.54         1
 70.69         1
 8.00          1
Name: overall, Length: 2691, dtype: int64

The 'overall' vaccination rate has been set to -1 in some rows. These are the rows for which the overall vaccination rate is not yet available, and which we're trying to predict.

Next, we will check how many rows have an 'overall' vaccination rate of '-1'. This will give us an idea of the amount of data we have for training our model (rows with a known overall vaccination rate) and for testing its predictions (rows with an 'overall' vaccination rate of '-1').

In [9]:
# Count the number of rows with 'overall' vaccination rate of -1
missing_overall = data[data['overall'] == -1].shape[0]

# Count the total number of rows
total_rows = data.shape[0]

missing_overall, total_rows, missing_overall / total_rows * 100

(20177, 46411, 43.474607312921506)