**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Adey Ali
- Makeil Ali
- Ayat Alwazir
- Asaki Billawala
- Mohammed Zaid

# Research Question

To what extent does political corruption, measured by indices such as the Corruption Perceptions Index, impact the long-term performance of national football teams in international tournaments, and how does institutional strength correlate with sustained competitive success when accounting for economic factors like GDP per capita and social factors like Press Freedom Index?

## Background and Prior Work


- Include a general introduction to your topic
- Include explanation of what work has been done previously
- Include citations or links to previous work

This section will present the background and context of your topic and question in a few paragraphs. Include a general introduction to your topic and then describe what information you currently know about the topic after doing your initial research. Include references to other projects who have asked similar questions or approached similar problems. Explain what others have learned in their projects.

Find some relevant prior work, and reference those sources, summarizing what each did and what they learned. Even if you think you have a totally novel question, find the most similar prior work that you can and discuss how it relates to your project.

References can be research publications, but they need not be. Blogs, GitHub repositories, company websites, etc., are all viable references if they are relevant to your project. It must be clear which information comes from which references. (2-3 paragraphs, including at least 2 references)

 **Use inline citation through HTML footnotes to specify which references support which statements** 

For example: After government genocide in the 20th century, real birds were replaced with surveillance drones designed to look just like birds.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) Use a minimum of 2 or 3 citations, but we prefer more.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) You need enough to fully explain and back up important facts. 

Note that if you click a footnote number in the paragraph above it will transport you to the proper entry in the footnotes list below.  And if you click the ^ in the footnote entry, it will return you to the place in the main text where the footnote is made.

To understand the HTML here, `<a name="#..."> </a>` is a tag that allows you produce a named reference for a given location.  Markdown has the construciton `[text with hyperlink](#named reference)` that will produce a clickable link that transports you the named reference.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.


# Hypothesis



- Include your team's hypothesis
- Ensure that this hypothesis is clear to readers
- Explain why you think this will be the outcome (what was your thinking?)

What is your main hypothesis/predictions about what the answer to your question is? Briefly explain your thinking. (2-3 sentences)

# Data

## Data overview

For each dataset include the following information
- Press Freedom Index
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Descripton: The Corruption Perceptions Index (CPI) dataset ranks countries based on perceived levels of public sector corruption, using a score        from 0 to 100 where higher scores indicate cleaner governance. Each entry includes a country name, score, rank, and score change from the previous     year, all of which serve as proxies for transparency and institutional integrity. A time series component tracks score changes from 2012 to 2024,      enabling analysis of trends across different political and economic contexts. The data is derived from at least three expert and businessperson        surveys out of thirteen trusted sources, such as the World Bank. Preprocessing steps include converting scores and ranks to numeric types,             standardizing country names, handling missing values, and structuring the time series for longitudinal analysis.
- Corruption Perception Index
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Descripton: The Corruption Perceptions Index (CPI) dataset ranks countries based on perceived levels of public sector corruption, using a score        from 0 to 100 where higher scores indicate cleaner governance. Each entry includes a country name, score, rank, and score change from the previous     year, all of which serve as proxies for transparency and institutional integrity. A time series component tracks score changes from 2012 to 2024,      enabling analysis of trends across different political and economic contexts. The data is derived from at least three expert and businessperson        surveys out of thirteen trusted sources, such as the World Bank. Preprocessing steps include converting scores and ranks to numeric types,             standardizing country names, handling missing values, and structuring the time series for longitudinal analysis.
- FIFA Men's Ranking
  - Dataset Name:
  - Link to the dataset: https://www.kaggle.com/datasets/cashncarry/fifaworldranking/code
  - Number of observations:
  - Number of variables: 
  - Descripton: The FIFA World Rankings dataset includes both men's and women's national football teams, ranked based on their total points calculated     from recent match results, tournament performance, and strength of opponents. Each entry contains key variables such as rank, team name, and total     points, which serve as proxies for competitive strength and consistency. The men's rankings are updated as of April 3, 2025, and the women's as of     March 6, 2025, offering a snapshot of current international standings. Cleaning the data requires dropping the statistics we are not analyzing and     only including men's rankings, and reshaping the data to make it a cleaner table. This dataset can be valuable for performance modeling, regional      comparisons, or assessing the impact of global tournaments on rankings. Cleaning the data requires dropping the statistics we are not analyzing        and only including men's rankings.
- Gross Domestic Product (GDP) per capita
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Descripton: The World Bank GDP per capita dataset includes both a global time series from 1960 to 2023 and the most recent GDP per capita values       for individual countries. The key variables are year, country name, and GDP per capita, expressed in constant 2015 US dollars to account for           inflation and enable accurate comparisons over time. These metrics serve as proxies for average individual economic output and overall                 development, providing insight into both global growth trends and national income disparities. Preprocessing would involve converting values to        consistent numeric types, standardizing country names, handling missing data, and merging with population or regional indicators if needed.            Together, these datasets offer powerful tools for analyzing long-term economic progress and cross-country differences in living standards.

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

## Setup

In [2]:
import pandas as pd
import numpy as np

## Press Freedom Index

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION
#hi

## Corruption Perception Index

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

## FIFA Men's Ranking

In [4]:
fifa_rankings = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vR1-ErfxE_SzHrtVCJC3IPzZdobPJHgxxhkEdz7Crn63LhcSQAXHnXUkeFl3tOFfd6P-ZPUef1hQyca/pub?gid=1330103511&single=true&output=csv')
fifa_rankings = fifa_rankings.drop(columns=['country_abrv', 'rank_change', 'total_points', 'confederation', 'previous_points'])
dates_keep = ['1992-12-31', '1993-12-23', '1994-12-20', '1995-12-19', '1996-12-18', '1997-12-23', '1998-12-23', '1999-12-22', '2000-12-20', '2001-12-19', '2002-12-18', '2003-12-15', '2004-12-20', '2005-12-16', '2006-12-18', '2007-12-17', '2008-12-17', '2009-12-16', '2010-12-15', '2011-12-21', '2012-12-19', '2013-12-19', '2014-12-18', '2015-12-03', '2016-12-22', '2017-12-21', '2018-12-20', '2019-12-19', '2020-12-10', '2021-12-23', '2022-12-22', '2023-12-21', '2024-06-20']
fifa_rankings= fifa_rankings[fifa_rankings['rank_date'].isin(dates_keep)]
fifa_rankings = fifa_rankings.dropna(axis=1, how='all')

#reshape data
fifa_rankings['rank_date'] = pd.to_datetime(fifa_rankings['rank_date'])
fifa_rankings['year'] = fifa_rankings['rank_date'].dt.year
fifa_rankings = fifa_rankings.rename(columns={'country_full': 'country'})
fifa_rankings = fifa_rankings.pivot(index='country', columns='year', values='rank')
fifa_rankings

#testing

  fifa_rankings = pd.read_csv('https://docs.google.com/spreadsheets/d/e/2PACX-1vR1-ErfxE_SzHrtVCJC3IPzZdobPJHgxxhkEdz7Crn63LhcSQAXHnXUkeFl3tOFfd6P-ZPUef1hQyca/pub?gid=1330103511&single=true&output=csv')


year,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,...,2015,2016,2017,2018,2019,2020,2021,2022,2023,2024
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,,,,,,,,,,,...,150.0,146.0,148.0,147.0,149.0,150.0,149.0,155.0,158.0,151.0
Albania,86.0,92.0,100.0,91.0,116.0,116.0,106.0,83.0,72.0,96.0,...,38.0,49.0,62.0,60.0,66.0,66.0,66.0,66.0,62.0,66.0
Algeria,30.0,35.0,57.0,48.0,49.0,59.0,71.0,86.0,82.0,75.0,...,28.0,38.0,58.0,67.0,35.0,31.0,29.0,40.0,30.0,44.0
American Samoa,,,,,,,193.0,199.0,203.0,201.0,...,167.0,191.0,194.0,190.0,192.0,192.0,190.0,188.0,188.0,188.0
Andorra,,,,,187.0,185.0,171.0,145.0,145.0,140.0,...,201.0,203.0,138.0,133.0,135.0,151.0,155.0,153.0,164.0,162.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Yemen,115.0,91.0,103.0,123.0,139.0,128.0,146.0,158.0,160.0,135.0,...,174.0,148.0,120.0,135.0,144.0,145.0,151.0,154.0,152.0,155.0
Yugoslavia,29.0,68.0,101.0,81.0,55.0,20.0,6.0,13.0,9.0,11.0,...,,,,,,,,,,
Zaire,60.0,71.0,68.0,68.0,66.0,76.0,62.0,,,,...,,,,,,,,,,
Zambia,32.0,27.0,21.0,25.0,20.0,21.0,29.0,36.0,49.0,64.0,...,73.0,88.0,74.0,83.0,88.0,90.0,88.0,88.0,84.0,90.0


## Gross Domestic Product (GDP) per capita

In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 

# Ethics & Privacy

- In the data that we have proposed, we will be using data directly from FIFA for some of the data. Getting data directly from the source of what we are trying to analyze, I believe, will make our data more reliable. We might have to look into where exactly the data we use from Wikipedia for the World Cup performance is directly coming from. Because data from Wikipedia can be edited, that is something we are going to have to consider; we could potentially get that data from a different avenue. 
- In order to make sure that our data doesn’t contain any bias, or to make sure there aren't any ethical concerns. We will make sure that all website and data files we upload into our project have credibility. This includes making sure that the data was not potentially edited by anyone, as well as making sure the publishers of the data are credible sources within the field we are interested in exploring the data from. To the best of our ability, we can also get data directly from sources, such as the FIFA World Cup website and as well as government official websites, to get data about corruption statistics within countries. We can also make sure to use multiple different sources and cross-reference data to make sure that everything aligns correctly. In the process of analyzing the data, we will also make it clear about the limitations that we encountered when trying to collect data, if any.

# Team Expectations 

- Primary mode of communication with be through text message
- Everyone is responsible for completing their assigned tasks
- If a team member doesn’t think they will get their task done on time or if there is some sort of conflict they will communicate the rest of the team as soon as possible
- Everyone should attend all meetings unless communicated otherwise from before
- Everyone should use credible sources and not make up any information used in the project.
Try and follow the timeline to their best ability so we don't fall behind

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 4/29  |  10 AM  | Brainstorm project ideas for project proposal  | Finalize project topic (corruption & football); assign roles for project proposal | 
| 4/30  |  11 AM  |  Background research; gather examples of corruption in football & policy | Discuss hypothesis, ethical concerns, and ideal dataset structure; submit proposal | 
| 5/5  | 2 PM  | Begin data wrangling (CPI, PFI, FIFA, World Cup, GDP); clean & align formats  | Review merged data; draft EDA visuals; plan modeling approach   |
| 5/14  | 10 AM  | Finish wrangling | Review wrangling; submit checkpoint; discuss EDA further   |
| 5/27  | 12 PM  | Work on data analysis | Review analysis; submit checkpoint |
| 6/8  | 12 PM  | Draft results, conclusions, and discussion sections | Review sections; draft video and record |
| 6/13  | Before 11:59 PM  | Final edits and proofreading | Submit final project and video, complete group project surveys and post-course survey |