# Corruption Perception Index Analysis

This notebook analyzes the Corruption Perception Index (CPI) data to identify the most and least corrupt countries, visualize the data, and cluster countries based on their corruption levels.

## 1. Loading and Exploring the Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans

df = pd.read_csv('corruption_data.csv')
print(df.head())
print(df.info())

## 2. Most and Least Corrupt Countries

The CPI score ranges from 0 (highly corrupt) to 100 (very clean).

In [None]:
most_corrupt = df.sort_values(by='CPI Score 2023', ascending=True).head(10)
least_corrupt = df.sort_values(by='CPI Score 2023', ascending=False).head(10)

print("\nMost Corrupt Countries (Lowest CPI Score):\n")
print(most_corrupt[['Country', 'CPI Score 2023']])

print("\nLeast Corrupt Countries (Highest CPI Score):\n")
print(least_corrupt[['Country', 'CPI Score 2023']])

## 3. Data Visualization

### 3.1 Top 10 Least Corrupt Countries

In [None]:
plt.figure(figsize=(12, 7))
sns.barplot(x='CPI Score 2023', y='Country', data=least_corrupt, palette='viridis')
plt.title('Top 10 Least Corrupt Countries (Highest CPI Score)')
plt.xlabel('CPI Score 2023')
plt.ylabel('Country')
plt.xlim(0, 100)
plt.show()

### 3.2 Top 10 Most Corrupt Countries

In [None]:
plt.figure(figsize=(12, 7))
sns.barplot(x='CPI Score 2023', y='Country', data=most_corrupt, palette='magma')
plt.title('Top 10 Most Corrupt Countries (Lowest CPI Score)')
plt.xlabel('CPI Score 2023')
plt.ylabel('Country')
plt.xlim(0, 100)
plt.show()

### 3.3 Distribution of CPI Scores

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df['CPI Score 2023'], bins=10, kde=True)
plt.title('Distribution of CPI Scores')
plt.xlabel('CPI Score 2023')
plt.ylabel('Number of Countries')
plt.show()

## 4. Machine Learning: Clustering Countries by Corruption Levels

We will use K-Means clustering to group countries based on their CPI scores. We'll choose 3 clusters to represent low, medium, and high corruption levels.

In [None]:
from sklearn.cluster import KMeans

X = df[['CPI Score 2023']]
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
df['Cluster'] = kmeans.fit_predict(X)

# Sort by CPI score to see the clusters clearly
df_sorted = df.sort_values(by='CPI Score 2023', ascending=False)
print(df_sorted[['Country', 'CPI Score 2023', 'Cluster']].head(15))

### 4.1 Visualizing the Clusters

In [None]:
plt.figure(figsize=(12, 8))
sns.scatterplot(x='CPI Score 2023', y='Country', hue='Cluster', data=df_sorted, palette='coolwarm', s=100)
plt.title('Countries Clustered by CPI Score')
plt.xlabel('CPI Score 2023')
plt.ylabel('Country')
plt.legend(title='Corruption Cluster')
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.show()