Version2 : Started August 22 - Ended August 24, 2024. 

Saved on August 18, 2024. By Marília Prata, mpwolke.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objs as go
import plotly.offline as py
import plotly.express as px

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


Absolut (diretory/dirname) and Relative paths 

In [None]:
import os
dirname = os.path.dirname('../input/commune-in-switzerland/')
filename = os.path.join(dirname, 'p1_communes.csv')

That will show a string, after that it will be required to read applying (pd.read_csv)

In [None]:
filename

<h1 style="font-size:180%; color:orange;"><i><b>An Overview of the Dataset</b></i></h1>

#A1: Import the data as a Pandas Dataframe and name it as df.

In [None]:
df = pd.read_csv("../input/commune-in-switzerland/p1_communes.csv")
df.head()

#A2: Check the number of rows and columns.

In [None]:
df.info()

Unique values: 26 Cantons ✔️ and 2202 Communes??

According to CH info: The 26 cantons are divided into 2131 communes.

https://www.ch-info.swiss/en/edition-2024/die-schweiz/foederalismus#:~:text=The%2026%20cantons%20are%20divided%20into%202131%20communes.

In [None]:
print ("\nUnique values :  \n",df.nunique())

#A3: Display the first few Entries of the DataFrame.

In [None]:
df.head()

#A4: Obtain the index labels, and then show the column names as a list.

"The columns method returns an Index object which contains all the column names. It can be converted to a list by using the list constructor or the tolist method."

Source: https://www.aporia.com/resources/how-to/get-list-dataframe-column-names/#:~:text=The%20columns%20method%20returns%20an,constructor%20or%20the%20tolist%20method.

In [None]:
# with list constructor
col_list = list(df.columns)
col_list

Listing columns is useful to check if there is some space between letters and ticks.

In [None]:
df.columns.tolist()

#A5: Check the data type for each column.  Attribute dtypes

In [None]:
df.dtypes

#A6: Missing Values 

Instead of line:  df.isnull().sum()

Show the rows with missing and their respectives Nan: 

df[df.isna().any(axis=1)] 

In [None]:
#Show the rows that contain the missing values. 

df[df.isna().any(axis=1)]

#Mapping to fill Canton Nans with their respective Communes

Mapping will help me to deal with B4. Since Nan is float. 

Dealing with Nan avoiding "TypeError: unsupported operand type(s) for +: 'float' and 'str'"

Apply Mapping to remove the Nan on the Canton column.

In [None]:
#By Jocelyn Dumlao https://www.kaggle.com/discussions/questions-and-answers/529998#2968648

# Dictionary mapping Communes to their corresponding Cantons
commune_to_canton = {
     'Thurnen': 'BE',
     'Villaz': 'FR',
     'Prez': 'FR'
 }

# Replace missing values in the 'State' column based on the 'City' column
df['Canton'] = df.apply(lambda row: commune_to_canton[row['Commune']] if pd.isnull(row['Canton']) else 
row['Canton'], axis=1)

# Display the updated DataFrame
#print(df)

Check the Canton column without Nan! And their respective Communes.

In [None]:
#Now, check the Canton without Nan! Only Language has Nan

df[df.isna().any(axis=1)]

#A7: Remove any observations to ensure that there are NO Missing Values. 

"You shouldn't impute data without a source or without using a sound **Statistical procedure** for the task." 

Therefore, avoid the simple: df.fillna(value=0,inplace=True)

In [None]:
#Apply SimpleImputer on the Missing values

from sklearn.impute import SimpleImputer
df_most_frequent = df.copy()
#setting strategy to 'mean' to impute by the mean
mean_imputer = SimpleImputer(strategy='most_frequent')# strategy can also be mean or median 
df_most_frequent.iloc[:,:] = mean_imputer.fit_transform(df_most_frequent)

In [None]:
df_most_frequent.isnull().sum()

<h1 style="font-size:180%; color:orange;"><i><b>Exploration: numerical summaries, indexing and grouping </b></i></h1>

#B1: Obtain the Mean, Minimum and Maximum value for each Column.

Method df.describe().loc[['mean','min','max']].T

Your output should preferably **show Only the three requested statistics** and not the full table of descriptive statistics.

In [None]:
#Describe showing Only the requested statistics (mean, minimum and maximum). Then, transpose the table.

df.describe().loc[['mean','min','max']].T

#B2: Ten Most Populated Communes

LIST the 10 MOST POPULATED COMMUNES, ordered by their NUMBER of RESIDENTS.

In [None]:
# Sort data by Residents in descending order 
most_populated = df.sort_values(by='Residents', ascending=False) 

# Display the 10 most populated communes 
most_populated.head(10)

#B3:Ten Least Populated Communes

LIST the 10 LEAST POPULATED COMMUNES, ordered by their NUMBER of RESIDENTS.

In [None]:
# Sort data by Residents in descending order 
least_populated = df.sort_values(by='Residents', ascending=True) 

# Display the 10 most populated communes 
least_populated.head(10)

#B4: Save into separate csv files: ZH.csv 

Group the communes by canton and save them into separate .csv files, e.g. a ZH.csv with all the data for communes in Zurich (Do not include the .csv files in your submission).

In [None]:
for ZH in df.Canton.unique():
    df[df.Canton == ZH].to_csv(ZH+'.csv')

#B5: Compute the population density at the canton level. 

Rank the cantons from **most dense** to **least dense.**

In [None]:
df_total=df[["Residents", "Surface area in km²", "Canton"]].groupby(['Canton']).sum()
df_total["Population density"]=df_total["Residents"]/df_total["Surface area in km²"]
df_total.sort_values(by=['Population density'], ascending=False)

Most Dense Cantons

Basel at on top: Pop density 5271.069012. Basel Stats from 2021, seems it has increased comparing with the data above.

Population (July 2021)
 • Total: 201,156
 • Density: 5,400/km2 (14,000/sq mi)

Canton of Zürich (ZH) with a population of 1,553,423 (as of 31 December 2020), it is the most populous canton of Switzerland.

https://en.wikipedia.org/wiki/Canton_of_Zurich#cite_note-Stat2020_ZH-4

Canton of Zürich (ZH):
Population (December 2020)
 • Total: 1,553,423
 • Density: 900/km2 (2,300/sq mi)
 https://en.wikipedia.org/wiki/Canon_of_Zurich

**Least Dense Cantons**

According to the least_dense, Canton of Grisons (Population density per km²: 27.919509).The population of the canton (as of 31 December 2020) is 200,096
seems to be increasing if we compare with the Wikipedia/Eurostat information from 2019.

Canton of Grisons (GR):

Population: 198,379 (2019) Eurostat

Area: 7,105 km²

https://en.wikipedia.org/wiki/Grisons

#B6: Communes in each Canton where more than 50 percent aged: 20-64 

Compute the number of communes in each canton where more than 50 percent of their populations are aged between 20 and 64 years old.

In [None]:
df[df['20-64 years'] > 50].groupby('Canton')['20-64 years'].count().sort_values()

Population density is directly proportional to the increase of the number of Residents and inversely proportional to the Surface area.

From that table above, it seems that Basel (BS Canton) has the higher density. Meanwhile, Grisons (GR Canton) occupies the 26th position on Population Density.

#B7: Difference Max/Min Elevations each Canton

Compute the difference between the maximum and minimum elevations for each canton. Find the top 5 cantons that have the largest range of elevations.

In [None]:
 df.groupby('Canton')['Elevation'].agg(lambda x : max(x) - min(x)).sort_values(ascending=False).head(5)

<h1 style="font-size:180%; color:orange;"><i><b>Exploration: Visualizations</b></i></h1>

Top 10 Populated communes (Stacked Bar chart)

Horizontal Bar plot: The top 10 populated communes.

Your task is to obtain a horizontal bar plot that shows the top 10 populated communes. Your bar chart should have the names of the communes listed vertically along the y-axis and the x-axis should show the populations. Ensure that the chart has an appropriate title and labels.

In [None]:
top10= df.sort_values(by=['Residents'], ascending=False).head(10)

#C2: Horizontal Stacked Bar chart

Plot a horizontal stacked bar chart that shows how The Top 10 Densely Populated Communes lands are divided into the 4 Area types: Settlement, Agricultural, Wooded and Unproductive areas.

In [None]:
#Plotted first cause I'll reset index to Commune on the barplot
#Otherwise, it would return: Commune Not on the list, Commune after reset is the Index 

top10[['Settlement area', 'Agricultural area','Wooded area','Unproductive area', 'Commune']].plot.barh(x='Commune',stacked=True)

plt.legend(loc=(1.04, 1))#StackOverFlow

# plot a horizontal stacked bar chart
#df.plot.barh(stacked=True)
plt.xlabel("Area Types in %")
plt.title('10 Densely Populated Communes by type of Area');


OBSERVATIONS: 

* Settlement area (blue) occupies most part of the bars. As expected on densely areas.

* Agricultural area (orange), due to the large density of this Top Communes, those orange sub-bars represent small parts of the bars. 

No matter, one-third of Switzerland's land is devoted to agricultural production and pasture (Source https://www.britannica.com/place/Switzerland/Agriculture-and-forestry)

* Wooded area (green), another small sub-bars, except in Onex. 

In Onex, 12.1% of the total land area is heavily forested (https://en.wikipedia.org/wiki/Onex,_Switzerland) 

According to https://www.vogelwarte.ch/ Woodland (including shrub forest) covers about one third of the area of Switzerland. 

* Unproductive area (red) it seems the size of the respectives sub-bars are insignificant or literally inexistent since their values were 0.000000 (Null on Massagno, Renens(VD), Chêne-Bourg and Prilly.

In [None]:
top10.head(3)

#C1: Horizontal Bar plot: The top 10 populated communes.

I inverted the tasks order cause reset the index for the bar plot caused an error. Therefore C2 (stacked bar chart) was delivered above. 

**Invert the y-axis** so that the **most densely populated** is displayed **at the top**

In [None]:
#Reset top10 index to Commune, otherwise our barplot will have numbers on y-axis.

top10.set_index("Commune", inplace=True)

In [None]:
#The first 10 rows, and from these rows, we are interested in the values of the column 'Residents'
values = top10["Residents"]
colormap = plt.cm.tab10(range(0, len(values)))
bar_plot = values.plot.barh(color=colormap)

# Add titles, labels, invert y-axis

bar_plot.set_title("The top 10 populated communes")
bar_plot.set_xlabel("Populations")

#Invert the axis to have the descending order
bar_plot.invert_yaxis()
plt.show(bar_plot)

In [None]:
#Save for Next time

#StackOverflow plt legend: https://stackoverflow.com/questions/4700614/how-to-put-the-legend-outside-the-plot
#Answered by ImportanceOfBeingErnest answered Apr 16, 2017 at 16:04

# create a dataframe
#data = {'Commune': ["Genève", "Massagno", "Carouge (GE)", "Vevey","Basel","Renens (VD)","Lancy", "Chêne-Bourg","Onex","Prilly"],
        #'Settlement area': [92.133417,97.260274,84.905660,89.873418,86.521557, 94.237288, 88.016529, 88.095238, 77.304965,82.191781],
        #'Agricultural area': [1.510384, 0.000000, 4.905660, 2.953586, 3.850984, 3.389831, 5.165289, 8.730159, 6.028369, 10.958904],
        #'Wooded area': [3.146633, 2.739726, 8.679245, 4.641350,3.641691,2.372881, 6.404959, 3.174603, 13.120567, 6.849315],
        #'Unproductive area': [3.209566, 0.000000, 1.509434, 2.531646, 5.985768,0.000000, 0.413223, 0.000000, 3.546099, 0.000000]} 
#df = pd.DataFrame(data)
#df = df.set_index('Commune')
#df

# plot a horizontal stacked bar chart
#df.plot.barh(stacked=True)
#plt.xlabel("Area Types in %")
#plt.title('10 Densely Populated Communes by type of Area')

#plt.legend(loc=(1.04, 1))#StackOverFlow

#plt.show()

In [None]:
df1 = pd.read_csv("../input/commune-in-switzerland/p1_communes.csv")
df1.head(2)

In [None]:
#Make sure that the index is properly defined

#df1.set_index("Commune", inplace=True) 

#C3: Histogram of Young Residents (0-19 yrs)

Obtain a Histogram that shows the number of communes for which the proportion of young residents (0-19 years) falls into the intervals.

Pyplot: using percentage on x axis

In [None]:
##Answered by Jocelyn Dumlao https://www.kaggle.com/discussions/questions-and-answers/529789#2967862

#StackOverflow https://stackoverflow.com/questions/26294360/pyplot-using-percentage-on-x-axis
#Answered by Ffisegydd Oct 10, 2014 at 8:14

import matplotlib.ticker as mtick
fig = plt.figure(1, (5,4))
ax = fig.add_subplot(1,1,1)

# Convert proportions to percentages
#df1['0-19 years'] = df1['0-19 years'] * 100

bins = [5, 10, 15, 20, 25, 30, 35]
labels = ['5%-10%', '10%-15%', '15%-20%', '20%-25%', '25%-30%', '30%-35%']

from matplotlib.ticker import PercentFormatter #Added 00

ax.xaxis.set_major_formatter(mtick.PercentFormatter())

hist = df1["0-19 years"].plot.hist(bins=bins, color="green", edgecolor="black")
hist.set_xlabel("Percent of Young Residents (0-19 years)")
hist.set_ylabel("Switzerland Communes")
hist.set_title(" Proportions of Young Residents (0-19 years)")

# Labeling the x-axis with percentage intervals
ax.xaxis.set_major_formatter(mtick.PercentFormatter())
ax.set_xticks(bins[:-1])  # Exclude the last bin edge for tick positions
ax.set_xticklabels(labels, rotation=45)

plt.xticks(rotation=45)

plt.show()

C4: Violin plot: Spoken languages by the young population

A violin plot of the young residents population (0-19 yrs) and the spoken languages. 

Deutsch (german) is the most spoken and the least Romansh. The Zero corresponds to the "Canton Zero" (fillna missing values). 

According to https://www.eda.admin.ch/ Languages and dialects:

"The four official national languages – German, French, Italian, and Romansh – are spoken in dialects that often differ significantly from their written counterparts." 
https://www.eda.admin.ch/aboutswitzerland/en/home/gesellschaft/sprachen/sprachen-und-dialekte.html

In [None]:
#By Daniel Flores Agreda since I delivered a single violin instead of subplots

fig, axes = plt.subplots(nrows = 1,
                         ncols = 3,
                         figsize = (17, 4), sharey=True)   

sns.violinplot(x="Language", y="0-19 years",   data=df1, ax=axes[0])
sns.boxplot(x="Language", y="20-64 years",     data=df1, ax=axes[1])
sns.boxplot(x="Language", y="65 years or over",data=df1, ax=axes[2])

axes[0].set_title('distribution of the age group 0-19 for language regions')
axes[1].set_title('distribution of the age group 20-64 for language regions')
axes[2].set_title('distribution of the age group 65 years or over for language regions')

fig.tight_layout()
plt.show()

#C5: Strip plot Subplots age groups 0-19 years, 20-64 years and 65 years or over.

Task for the three age groups 0-19 years, 20-64 years, and 65 years or over. In order to make the comparison easy, you should make a plot with one subplot per age group (plot with 1 row and 3 columns). Ensure that the subplots have appropriate titles, legends, and labels. Also, ensure that they have proper sizes and there is enough space between them.

I tried though I couldn't add the percent symbol this time.

Population 20-64 years represents the higher proportion (50% till 80%) on both Strip plot and Box plot.

In [None]:
#Creating a stripplot to visualize differences in data distribution between languages spoken

import matplotlib.ticker as mtick

#Unfortunately, No percent symbol was addde to y-axis

ax.yaxis.set_major_formatter(mtick.PercentFormatter(5.0))



features = ['0-19 years', '20-64 years', '65 years or over']

n = 1

sns.set_style('darkgrid')
plt.figure(figsize = (7, 4))

fmt = '%.0f%%' # Format you want the ticks, e.g. '40%'
yticks = mtick.FormatStrFormatter(fmt)


for feature in features:
    plt.subplot(1,3,n)
    sns.stripplot(x = df1['Language'], y = df1[feature], palette = 'summer').set(xlabel = None, ylabel = None)
    plt.ylabel('Population Percentages')
    plt.xlabel('Languages Spoken')
    plt.title(f'{feature} Strip plot')
    n = n + 1
    plt.tight_layout()

In [None]:
#By Fatih Bilgin https://www.kaggle.com/code/fatihbilgin/quick-eda-and-data-visualization-for-beginners

import plotly.graph_objs as go
from plotly.offline import iplot
import plotly.express as px

trace1 = go.Box(
    y=df1["0-19 years"],
    name = 'Population 0-19 years',
    marker = dict(color = 'rgb(255,111,145)')
)

trace2 = go.Box(
    y=df1["20-64 years"],
    name = 'Population 20-64 years',
    marker = dict(color = 'rgb(132,94,194)')
)

trace3 = go.Box(
    y=df1["65 years or over"],
    name = 'Population 65 years or over',
    marker = dict(color = 'rgb(255,111,145)')
)

data = [trace1, trace2, trace3]
layout = dict(autosize=False, width=700,height=500, title='Communes Population 0-65 or over yrs.', paper_bgcolor='rgb(243, 243, 243)', 
              plot_bgcolor='rgb(243, 243, 243)', margin=dict(l=40,r=30,b=80,t=100,))

fig = dict(data=data, layout=layout)
iplot(fig)

#C6: Three Pairplots

Produce 3 plots to visually investigate the relation between the Agricultural area of communes and their Settlement area, Wooded area and the Unproductive area.

**First pairplot**, Agricultural and Settlement areas. It seems linear, almost a slope.
Densely populated areas where the Elevation is maximum 500.

Source for the future: https://stats.stackexchange.com/questions/636112/how-to-interpret-pairplots

In [None]:
#Relation between the Agricultural area of communes and their Settlement area

sns.pairplot(df1, x_vars=["Agricultural area"], 
                   y_vars=["Settlement area"], 
                   hue="Elevation", 
                   height=3)

plt.show()

**Second Pairplot**: Wooded and Unproductive areas.

Second Pairplot, as we can notice, the higher the Elevation, directly increases the unproductive areas.
It also seems a linear relationship.

In [None]:
sns.pairplot(df1, x_vars=["Wooded area"], 
                   y_vars=["Unproductive area"], 
                   hue="Elevation", 
                   height=3)
plt.show()

Communes that are **located in high altitudes**  to plot the **third Pairplot**.

In [None]:
#Select all the Elevations that are over 2000m. The Boolean condition is then 'Elevation' > 2000. 

#df["Elevation"] > 2000

df1[df1["Elevation"] > 2000].head(2)

In [None]:
#elev: Communes where elevation is higher than 2000

elev = df1[df1["Elevation"] > 2000]


**Third Pairplot**

There seems to be very little correlation between Settlement, Agricultural and Unproductive areas.
I can Not see linearity  among these features. (No straight lines).

In [None]:
#Third Pair plot

sns.pairplot(elev, x_vars=["Agricultural area", "Unproductive area"], 
                   y_vars=["Settlement area"], 
                   hue="Elevation", 
                   height=3)
plt.show()

#C7: Switzerland Maps

Map of Switzerland using the East and North coordinates of communes. We can see the three geographic regions, namely the Swiss Alps, the Central Plateau and the Jura.

In [None]:
#By Marília Prata on Kaggle https://www.kaggle.com/code/mpwolke/airports-maps
#StackOverFlow https://stackoverflow.com/questions/25328003/how-can-i-change-the-font-size-using-seaborn-facetgrid

sns.set(font_scale=3) 

plt.figure(figsize=(20,12))
ax = plt.gca()
ax.set_title("Switzerland Geographic Regions")

g = sns.scatterplot(x='East coordinate', y='North coordinate', data=df1, hue='Elevation')
g.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1);

On the map by Elevation, we can see a concentration at the North of the country where altitude is lower.

Below, it's remarkable the Deutch (German) Language spoken mostly on the North. French spoken (near Lake Genève) on the left and near of the French boarder. And respectively, italian language near Italy (Italy–Switzerland border).

The Zero/Null language is a consequence of the fillna (Prez, Villaz und Thurnen Communes).

In [None]:
#By Marília Prata on Kaggle https://www.kaggle.com/code/mpwolke/airports-maps
#StackOverFlow https://stackoverflow.com/questions/25328003/how-can-i-change-the-font-size-using-seaborn-facetgrid

sns.set(font_scale=3) 

plt.figure(figsize=(20,12))
ax = plt.gca()
ax.set_title("Switzerland by Spoken Languages")

g = sns.scatterplot(x='East coordinate', y='North coordinate', data=df1, hue='Language')
g.legend(loc='center left', bbox_to_anchor=(1.25, 0.5), ncol=1);

**Switzerland Map subplots**

Two plots from the previous task as subplots (with 1 row and 2 columns). 

In [None]:
sns.set(font_scale=1)

fig, axes = plt.subplots(nrows = 1, # sets the number of rows in the subplot grid
                         ncols = 2, # sets the number of columns in the subplot grid
                         figsize = (10, 5), # sets the width and height of the figure (in inches)
                         sharey=True) # let the two plots share the same y-axis

# the left-hand side scatter plot
sns.scatterplot(data=df1, x="East coordinate", y="North coordinate", hue='Elevation', 
                ax=axes[0])   # draw the plot on the first axes


axes[0].set_title('Switzerland by Geographic Regions')
axes[0].legend(loc='upper right', fontsize=8) # set the legend location
axes[0].set_xlabel('East coordinate')
#axes[1].set_ylabel('North coordinate')

# the right-hand side scatter plot
sns.scatterplot(data=df1, x="East coordinate", y="North coordinate", hue='Language', 
                ax=axes[1],   # draw the plot on the second axes
                legend=True) #Attention to turn on/of the legend
    
axes[1].set_title('Switzerland by Spoken Languages')
axes[1].legend(loc='upper right', fontsize=8) # set the legend location
axes[1].set_xlabel('East coordinate')
#axes[1].set_ylabel('North coordinate')

fig.tight_layout()
plt.show()

<h1 style="font-size:180%; color:orange;"><i><b>Probabilities</b></i></h1>

In [None]:
#We have 98 entries/rows for Elevations higher than 2000

len(elev)
elev.shape

In [None]:
#Valais Canton where elevation is higher than 2000. Entries for VS Valais Canton higher than 2000m : 49 

VS_elev = elev[(elev['Canton']=='VS')].reset_index(drop=True)
VS_elev.tail(2)

In [None]:
#Entries/rows for VS Valais Canton Elevation higher than 2000m : 49

VS_elev.shape

#D Probabilities

Compute the **probability that a randomly selected commune with elevation over 2000** is from the Canton of **Valais**.

49 entries is exactly 50% of 98 entries (Total Elevations higher than 2000). 

In [None]:
#By Matin Mahamoudi  https://www.kaggle.com/discussions/general/529232#2969415

# Number of communes with elevation ≥ 2000
total_elev_communes = 98

# Number of communes in VS with elevation ≥ 2000
vs_elev_communes = 49

# Calculate the conditional probability
probability = vs_elev_communes / total_elev_communes

print(f"Estimated Probability: {probability:.2f}")

By Matin Mahmoudi [Learn Bayes' Theorem Thoroughly: A Lifelong Understanding](https://www.kaggle.com/discussions/general/529232#2969415)

![image.png](attachment:53f3cd84-4c1d-4f08-8bb4-7e479e9b19a5.png)

<h1 style="font-size:180%; color:orange;"><i><b>Matrices</b></i></h1>

Tip by Teacher Daniel Flores Agueda to apply Crosstab.

However, Crosstab  returned: NameError: name 'cleaned_df' is not defined

In [None]:
#Alternative approaches include:
#pd.crosstab(df2['Commune'], df2['Canton']), pd.get_dummies(cleaned_df.Canton) (cleaned_df[['Canton']].values == cleaned_df.Canton.unique()).astype(int)

#That returned NameError: name 'cleaned_df' is not defined

In [None]:
#Number of Communes

#https://saturncloud.io/blog/how-to-get-unique-values-in-multiple-columns-using-pandas/#:~:text=Getting%20Unique%20Values%20in%20Multiple%20Columns,-Now%20that%20we&text=We%20can%20use%20the%20drop_duplicates,Pandas%20DataFrame%20to%20achieve%20this.&text=The%20drop_duplicates()%20method%20removes,the%20Age%20and%20City%20columns.

unique_communes = df['Commune'].unique()

print(unique_communes)
len(unique_communes)

According to Wikipedia: "There are 2,131 municipalities as of January 2024."

In [None]:
#Switzerland has 26 Cantons

unique_cantons = df['Canton'].unique()

print(unique_cantons)
len(unique_cantons)

#E Matrices

Define a data frame matrix whose rows correspond to communes and the columns to the cantons. Fill in the matrix with 0/1 values where entry (i,j) is a 1 if the commune in row i is in the canton in column j and a 0 otherwise.

In [None]:
#https://www.kaggle.com/discussions/questions-and-answers/528667#2961796
#Answered by Metin Meki Abullrahman

#Define the number of rows and columns
rows = 2202
columns = 26

#Create a 2D array with 26 columns filled with zeros
zeros_array = np.zeros((rows,columns))

#Create a 2D array with 2202 rows filled with ones
ones_array = np.ones((rows, 1))

#Concatenate the arrays horizontally
matrix = np.hstack((ones_array, zeros_array))

#Convert to a DataFrame
df = pd.DataFrame(matrix)

#Display the DataFrame
print(df)

That's all for the First List of Tasks. Course Introduction to data analysis with Python. It was hard for a beginner to deal with so many snippets and having to deal with the data after each new indexation.

No pain, no gain. Das war's!