<a href="https://colab.research.google.com/github/HSE-LAMBDA/MLDM-2021/blob/master/03-linear-classification-and-regularization-pt1/MLDM_2021_seminar03_EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#!pip install folium

In [None]:
#!pip install geopandas

In [None]:
#!wget https://raw.githubusercontent.com/HSE-LAMBDA/MLDM-2021/main/03-linear-classification-and-regularization-pt1/EDA_dataset.zip

In [None]:
#!unzip EDA_dataset

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import branca.colormap as cm_b
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from scipy.stats import norm, normaltest, lognorm
from sklearn.linear_model import LinearRegression
from scipy.interpolate import interpn
import itertools
import folium
import glob

The main focus of this notebook is Exploratory Data Analysis (EDA) involving geodata. All the source datasets are taken from RUElectionData (https://t.me/RUElectionData). These data sources may be not precise, because these documents are not official ones. Thus, we use them just as an example of possible data to perform the EDA and plotting without any conclusion. 

Let's load the data containing coordinates of the Election commissions (УИК).

In [None]:
uiks = pd.read_json('./uiklinks_2021_09_19_ЦИК_России_Выборы_депутатов_Государственной', lines=True)

In [None]:
uiks.head()

We can define a new column using `apply` method that invokes a function on values of uik Series.

In [None]:
uiks['fixed_name'] = uiks.name.apply(lambda x: str(x).replace('Участковая избирательная комиссия','УИК'))

You can define functions before using `apply` as well.

In [None]:
def get_lat(x):
    try:
        return x['lat'].replace(' ','')
    except:
        return np.nan

In [None]:
def get_lon(x):
    try:
        return x['lon'].replace(' ','')
    except:
        return np.nan

In [None]:
def is_Moscow(x):
    try:
        return 1 if 'город Москва' in x['address'] else 0
    except:
        return np.nan

In [None]:
uiks['lat'] = uiks.address.apply(get_lat).astype(float)

In [None]:
uiks['lon'] = uiks.address.apply(get_lon).astype(float)

In [None]:
uiks['isMoscow'] = uiks.address.apply(is_Moscow)

Is it possible to achieve same results calling `apply` once?

In [None]:
#<YOUR CODE HERE>

Now, let's have a look at Cheryomushki District data (you may choose any other district as well).

In [None]:
cher_data = pd.read_csv('./Moscow/Город Москва – Черемушкинский.tsv', sep = '\t')

In [None]:
cher_data.head()

Using `uiks` dataframe we can map commissions according to their coordinates via `folium`.

In [None]:
# Here we create a Map object, define starting location that is used during the visualization
m_1 = folium.Map(
    location=[55.674093, 37.620407],
    zoom_start=11
)

In [None]:
uiks_list = cher_data.uik.unique()

In [None]:
# Now we can add markers using coordinates
mask = (uiks['isMoscow'] == 1)
for place in uiks_list:
    try:
        folium.Marker(location=[uiks[(uiks.fixed_name == place) & mask]['lat'].values[0],
                                uiks[(uiks.fixed_name == place) & mask]['lon'].values[0]],
                      popup=str(place)).add_to(m_1)
    except:
# Have a look on the comissions without coordinates. Do you have any ideads about them? What do they have in common?        
        print(place)

In [None]:
m_1 # Executing this cell you call an interactive map

It's time to have a look at the results.

In [None]:
candidates = cher_data.columns[17:-1] # These columns correspond to candtidates' names

In [None]:
votes = cher_data[candidates].sum()

In [None]:
# This function provides us with labes
def make_autopct(values):
    def my_autopct(pct):
        total = sum(values)
        val = int(round(pct*total/100.0))
        return '{p:.2f}%  ({v:d})'.format(p=pct,v=val)
    return my_autopct

plt.figure(figsize=(18,18))

plt.pie(votes.sort_values(),
        labels=votes.sort_values().index,
        autopct=make_autopct(votes.sort_values()),
        wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'},
        pctdistance=0.8,
        radius=0.8,
        startangle=0,
        textprops=dict(color='k', fontsize=12));

What about turnout?

In [None]:
cher_data['turnout'] = (cher_data['Число действительных избирательных бюллетеней'] + 
                    cher_data['Число недействительных избирательных бюллетеней']) / cher_data['Число избирателей, внесенных в список избирателей на момент окончания голосования']  

We can plot commissions with high turnout values.

In [None]:
m_2 = folium.Map(
    location=[55.674093, 37.620407],
    zoom_start=11
)

In [None]:
mask = (uiks['isMoscow'] == 1)
for place in uiks_list:
    temp_t = cher_data[(cher_data.uik == place)]['turnout'].values[0]
    if temp_t > 0.4:
        try:
            folium.Marker(location=[uiks[(uiks.fixed_name == place) & mask]['lat'].values[0],
                                                  uiks[(uiks.fixed_name == place) & mask]['lon'].values[0]],
                          popup=place + ' ' + str(temp_t)[:4]).add_to(m_2)
        except:
            print(place)

In [None]:
m_2

In [None]:
cher_data[cher_data.uik == 'УИК №2366'].T

Do you have any insight? Notice the number of votes.

In [None]:
cher_data[cher_data.turnout > 0.8]

Let's plot the pie once again, but now we drop commissions with high turnout values.

In [None]:
votes_wo_5014 = cher_data[(~cher_data.uik.isin(cher_data[cher_data.turnout > 0.8].uik.values))][candidates].sum()

In [None]:
plt.figure(figsize=(18,18))
plt.pie(votes_wo_5014.sort_values(),
        labels=votes_wo_5014.sort_values().index,
        autopct=make_autopct(votes_wo_5014.sort_values()),
        wedgeprops={'linewidth': 3.0, 'edgecolor': 'white'},
        pctdistance=0.8,
        radius=0.8,
        startangle=0,
        textprops=dict(color='k', fontsize=12));

Is there any relationship between the turnout and votes sharing? What can you say about the weights of LinearRegression models? 

In [None]:
k = 0

fig, ax = plt.subplots(5,2,figsize=(20,25))
cand_lin_reg = LinearRegression()

for cand in candidates:
    # A possible way to define the place of the current plot on the whole sublots space
    i = k // 2
    j = k % 2
    
    x = cher_data.turnout.fillna(0)
    y = (cher_data[cand] / cher_data['Число действительных избирательных бюллетеней']).fillna(0)
    
    # This part makes 2d-histogram just a bit prettier
    data , x_e, y_e = np.histogram2d( x, y, bins = (100,100), density = True )
    z = interpn( ( 0.5*(x_e[1:] + x_e[:-1]), 0.5*(y_e[1:]+y_e[:-1]) ),
                data,
                np.vstack([x,y]).T,
                method = "splinef2d",
                bounds_error = False)
    z[np.where(np.isnan(z))] = 0.0
    idx = z.argsort()
    x, y, z = x[idx], y[idx], z[idx]
    
    # Here we plot all the pairs of turnouts and fractions of votes
    # using colors according to density
    ax[i,j].scatter( x, y, c=z, )
    cand_lin_reg.fit(x.values.reshape(-1, 1), y, cher_data[cand])    
    ax[i,j].plot(x.sort_values(), cand_lin_reg.predict(x.sort_values().values.reshape(-1, 1)), c='purple')
    ax[i,j].set_title(cand)
    ax[i,j].set_xlabel('Voter turnout')
    ax[i,j].set_ylabel('Votes Share')
    plt.subplots_adjust(hspace = .3)
    
    k += 1
plt.show();

Now let's proceed to full Moscow data.

In [None]:
# define the path to the directory contaning all the datasets
path = './Moscow/'
all_files = glob.glob(path + "/*.tsv")

dfs = []

# load the datasets
for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0, sep = '\t')
    dfs.append(df)

# merge all the datasets
full_data = pd.concat(dfs, axis=0, ignore_index=True)

In [None]:
full_data.head()

In [None]:
full_data['turnout'] = (full_data['Число действительных избирательных бюллетеней'] + 
                        full_data['Число недействительных избирательных бюллетеней']) / full_data['Число избирателей, внесенных в список избирателей на момент окончания голосования']  

Once again, let's plot commissions and use colors according to turnout.

In [None]:
m_3 = folium.Map(
    location=[55.674093, 37.620407],
    zoom_start=9
)

In [None]:
full_uiks_list = full_data.uik.unique()

In [None]:
colormap = cm_b.LinearColormap(colors=['orange','red',],vmin=0.5,vmax=1.)

In [None]:
mask = (uiks['isMoscow'] == 1)
for place in full_uiks_list:
    cur_app = full_data[(full_data.uik == place)]['turnout'].values[0]
    if cur_app > 0.5:
        try:
            radius = 10
            folium.CircleMarker(radius=radius,
                          location=[uiks[(uiks.fixed_name == place) & mask]['lat'].values[0],
                                    uiks[(uiks.fixed_name == place) & mask]['lon'].values[0]],
                          popup=place + ' ' + str(cur_app)[:4],
                          fill=True, color=colormap(cur_app)).add_to(m_3)
        except:
            print(place)

In [None]:
m_3

Have a look at linear models for top-14 Candidates according to number of votes.

In [None]:
k = 0

fig, ax = plt.subplots(7,2,figsize=(20,28))
cand_lin_reg = LinearRegression()
full_res = full_data[full_data.columns[17:-1]].sum()

for cand in full_res.drop('url').sort_values(ascending=False).index[:14]:
    i = k // 2
    j = k % 2
    
    temp_data = full_data[full_data.oik == full_data[full_data[cand] > 0].oik.values[0]].reset_index()
    
    x = temp_data.turnout.fillna(0)
    y = (temp_data[cand] / temp_data['Число действительных избирательных бюллетеней']).fillna(0)
    
    data , x_e, y_e = np.histogram2d( x, y, bins = (100,100), density = True )
    z = interpn((0.5*(x_e[1:] + x_e[:-1]),
                0.5*(y_e[1:]+y_e[:-1])),
                data , np.vstack([x,y]).T,
                method = "splinef2d",
                bounds_error = False)
    z[np.where(np.isnan(z))] = 0.0
    idx = z.argsort()
    x, y, z = x[idx], y[idx], z[idx]
    
    ax[i,j].scatter( x, y, c=z)
    cand_lin_reg.fit(x.values.reshape(-1, 1), y, temp_data[cand])    
    ax[i,j].plot(x.sort_values(), cand_lin_reg.predict(x.sort_values().values.reshape(-1, 1)), c='purple')
    ax[i,j].set_title(cand)
    ax[i,j].set_xlabel('Voter turnout')
    ax[i,j].set_ylabel('Votes Share')
    plt.subplots_adjust(hspace = .3)
    k += 1
    
plt.show();

Now let's make a distribution plot of turnout of polling stations with a step 1%.


In [None]:
plt.figure(figsize=(17,7))
plt.hist(full_data.turnout.fillna(0) * 100, bins=100, color='limegreen')
plt.axvline(x=full_data.turnout.fillna(0).mean() * 100, ls='--', label='mean')
plt.axvline(x=full_data.turnout.fillna(0).median() * 100, c='black', ls='--', label='median')
plt.axvline(x=full_data.turnout.fillna(0).mode()[0] * 100, c='red', ls='--', label='mode')
plt.axvline(x=full_data.turnout[full_data.turnout != 1.0].mode()[0] * 100, c='yellow', ls='--', label='mode without 100%')
plt.title('Number of Polling places with given turnout across Moscow')
plt.xlabel('Voter turnout in %')
plt.ylabel('Number of Polling places')
plt.xticks(np.linspace(0, 100, 21))
plt.xlim(-1,101)
plt.legend()
plt.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.75)
plt.show()

Do the polling stations with a high turnout have a real impact? Lets plot of the number of voters related to specific turnout percentage with a step 0.1%

In [None]:
pick_df = full_data[['Число действительных избирательных бюллетеней', 'Число недействительных избирательных бюллетеней', 'turnout']]
pick_df = pick_df.sort_values('turnout')
pick_df['turnout'] = pick_df['turnout'].round(3)
pick_df = pick_df.groupby('turnout').sum()
pick_df['size'] = pick_df.sum(axis=1)
pick_df = pick_df.drop(['Число действительных избирательных бюллетеней', 'Число недействительных избирательных бюллетеней'], axis=1)

plt.figure(figsize=(13,10))
plt.plot(pick_df.index * 100, pick_df)
plt.title('Number of voters related to turnout')
plt.xlabel('Voter turnout in %')
plt.ylabel('Number of present voters')
plt.xticks(np.linspace(0, 100, 11), fontsize=10)
plt.xlim(-1,100)
plt.ylim(0)
plt.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.75)
plt.show()

In [None]:
ones = full_data[(full_data['Число действительных избирательных бюллетеней'] == 
                  full_data['Число избирателей, внесенных в список избирателей на момент окончания голосования'] )]

In [None]:
m_4 = folium.Map(
    location=[55.674093, 37.620407],
    zoom_start=11
)

In [None]:
ones.uik.values

In [None]:
mask = (uiks['isMoscow'] == 1)
for place in ones.uik.values:
    try:
        radius = 10
        folium.CircleMarker(radius=radius,
                      location=[uiks[(uiks.fixed_name == place) & mask]['lat'].values[0],
                                uiks[(uiks.fixed_name == place) & mask]['lon'].values[0]],
                      popup=place,
                      fill=True).add_to(m_4)
    except:
        print(place)

In [None]:
m_4

Do these commissions really have important impact?

In [None]:
uiks[uiks.fixed_name.isin(ones.uik.values) & (uiks.isMoscow == 1)].address.values

In [None]:
ones['Число избирателей, внесенных в список избирателей на момент окончания голосования'].sum()