In [82]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import pandas as pd
import numpy as np

## Description

Kickstarter is a service where anyone can try to get funding from private individuals and companies to realize a project. The data consists of a few thousand published projects. If you are unsure about what Kickstarter is, google it! :-)

The task is to use the Python library "pandas" to work with the data. This lab assumes that you have studied and participated in lessons on pandas to master the solutions.

### Use Jupyter Notebook in VS Code (Option 1), Recommended
Solve the notebook in VS Code. You will then need to install pandas using pip or conda, for example, "pip install pandas". Follow this YouTube video: https://www.youtube.com/watch?v=jNk-ZmeIz6c&ab_channel=VeryAcademy
**NOTE** IF YOU ARE GOING TO INCLUDE A VIRTUAL ENVIRONMENT (VENV), YOU MUST NAME IT "VENV" OR CHANGE THE NAME IN THE .gitignore FILE TO THE SAME NAME AS YOUR VIRTUAL ENVIRONMENT. In other words, you should not push your virtual environment to git! In many cases, you do not need a virtual environment (venv); you can run pip install pandas directly. Try changing the "interpreter" by typing ctrl + shift + p --> select interpreter --> test all your different versions of python.

If you cannot run your cells, make sure you have selected a "kernel" at the top right.

### Use Google Colab for your Notebook (Option 2), Not Recommended
You can complete the task using Google Colab. You will need to upload main.ipynb followed by adding the test data "kickstarter.csv". Just keep in mind that at least 2-3 times during the process, you should save the file, make a commit in git, and push it out. Then you continue. This is the disadvantage of Colab—that we cannot work with git as smoothly.

### How to work in the Notebook
- **It must be possible to run the code from start to finish and get the same result again.**
- You should describe the code in each cell so that I can see that you understand what you are doing.
- Use the documentation as support: [https://pandas.pydata.org/docs/reference/frame.html](https://pandas.pydata.org/docs/reference/frame.html)

## Section 1

### Load the dataset kickstarter.csv into a dataframe, name it "df"

In [83]:
df = pd.read_csv('kickstarter.csv') # I loaded the dataset using pd.read_csv('kickstarter.csv') into a DataFrame and named it 'df' by assigning it with '='. 
df                                  # Then I called it to view the DataFrame

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address
0,1054480724ID,Luminoodle COLOR & BASECAMP: Versatile & Porta...,Product Design,Design,USD,2016-04-22,25000.0,2016-03-22 16:39:57,408052.0,successful,5085,US,98849$ usd,408052.00,25000.00,"Mary Grove, 30815"
1,1317034686ID,Liberty's work was accepted into the Library o...,Shorts,Film & Video,USD,2017-02-23,1100.0,2017-02-03 07:26:57,1106.0,successful,10,US,52486$ usd,1106.00,1100.00,"Garcia Tunnel, 77525"
2,121816660ID,Dusk - Open World Survival Horror (Canceled),Video Games,Games,GBP,2015-12-19,50000.0,2015-11-18 01:53:10,1625.0,canceled,58,GB,53482$ usd,2422.88,74550.09,"Payne Isle, 42813"
3,49882738ID,Eatsleeptrainrepeat,Web,Technology,SEK,2014-12-24,180000.0,2014-11-24 21:47:27,0.0,failed,0,SE,64632$ usd,0.00,23054.76,"Laura Stream, 08932"
4,1503277376ID,FLARE Video Project,Shorts,Film & Video,USD,2011-03-12,5000.0,2011-02-10 20:32:26,5025.0,successful,9,US,99272$ usd,5025.00,5000.00,"Andrew Pike, 24782"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,867650362ID,SideWayz Theatre's First Production Fund,Theater,Theater,USD,2016-03-23,2500.0,2016-03-03 23:30:22,25.0,failed,1,US,70106$ usd,25.00,2500.00,"Fernandez Cliff, 07444"
1996,1512412268ID,THE LOVE & MISERY OF A MOONSHINE MIND Project,Publishing,Publishing,USD,2012-10-11,4850.0,2012-09-14 23:43:32,40.0,failed,2,US,77334$ usd,40.00,4850.00,"Williams Junction, 45598"
1997,508267839ID,Mall-World Animated Series - Comedy,Animation,Film & Video,USD,2016-12-19,25000.0,2016-11-04 00:45:31,86.0,failed,2,US,95565$ usd,86.00,25000.00,"Elaine Point, 97758"
1998,945357371ID,Behind the Food Carts Road Trip,Photography,Photography,USD,2014-02-04,5000.0,2014-01-14 23:48:24,5610.0,successful,77,US,86498$ usd,5610.00,5000.00,"Nancy Dam, 25971"


### Print the first 5 rows

In [84]:
df.head() # We always call it by its name, in this case 'df', and using .head() we get the top rows.
          # If we want to display 5 rows of a DataFrame, we don't need to write 5. 
          # If the parentheses are empty, it automatically shows 5 rows. 
          # This applies only to number 5


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address
0,1054480724ID,Luminoodle COLOR & BASECAMP: Versatile & Porta...,Product Design,Design,USD,2016-04-22,25000.0,2016-03-22 16:39:57,408052.0,successful,5085,US,98849$ usd,408052.0,25000.0,"Mary Grove, 30815"
1,1317034686ID,Liberty's work was accepted into the Library o...,Shorts,Film & Video,USD,2017-02-23,1100.0,2017-02-03 07:26:57,1106.0,successful,10,US,52486$ usd,1106.0,1100.0,"Garcia Tunnel, 77525"
2,121816660ID,Dusk - Open World Survival Horror (Canceled),Video Games,Games,GBP,2015-12-19,50000.0,2015-11-18 01:53:10,1625.0,canceled,58,GB,53482$ usd,2422.88,74550.09,"Payne Isle, 42813"
3,49882738ID,Eatsleeptrainrepeat,Web,Technology,SEK,2014-12-24,180000.0,2014-11-24 21:47:27,0.0,failed,0,SE,64632$ usd,0.0,23054.76,"Laura Stream, 08932"
4,1503277376ID,FLARE Video Project,Shorts,Film & Video,USD,2011-03-12,5000.0,2011-02-10 20:32:26,5025.0,successful,9,US,99272$ usd,5025.0,5000.0,"Andrew Pike, 24782"


### Print 20 random rows from the dataset

In [85]:
df.sample(20) # When we call it with .sample(), it returns random rows. 
              # We write the number of rows we want inside the parentheses


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address
1797,300854708ID,The Summer's Cypher & Documentary Shoot: OKC,Documentary,Film & Video,USD,2017-05-01,6000.0,2017-04-01 15:12:43,90.0,canceled,2,US,98343$ usd,90.0,6000.0,"Leslie Fords, 71279"
1281,483438923ID,WAVERYDER,Video Games,Games,GBP,2013-06-23,6600.0,2013-05-09 06:57:37,46.0,failed,5,GB,68268$ usd,70.19,10070.95,"Lyons Ports, 73691"
961,801110890ID,Waiting for Lefty - LA to NY,Theater,Theater,USD,2012-02-10,10000.0,2011-12-12 23:03:25,10764.4,successful,203,US,84096$ usd,10764.4,10000.0,"Dillon Landing, 46323"
776,1421990050ID,Juárez Siempre en el Alma / My Soul always wit...,Art Books,Publishing,USD,2012-07-09,10000.0,2012-06-18 18:47:22,0.0,failed,0,US,76139$ usd,0.0,10000.0,"Barbara Corner, 72515"
1634,1751639331ID,Meet kubo: an electric scooter to carry all yo...,Hardware,Technology,USD,2013-12-20,300000.0,2013-11-20 06:00:07,56667.0,failed,166,US,99394$ usd,56667.0,300000.0,"Davis Ville, 65551"
1783,1165360349ID,Yesterday's Tomorrow Album,Hip-Hop,Music,USD,2014-04-17,6000.0,2014-03-18 22:40:24,6478.0,successful,133,US,73578$ usd,6478.0,6000.0,"Lance Stream, 48529"
988,1600323371ID,Tomorrow: an apocalyptic nightmare,Tabletop Games,Games,USD,2013-01-08,15000.0,2012-12-07 03:43:23,66820.47,successful,758,US,86738$ usd,66820.47,15000.0,"Reed Fields, 50234"
1995,867650362ID,SideWayz Theatre's First Production Fund,Theater,Theater,USD,2016-03-23,2500.0,2016-03-03 23:30:22,25.0,failed,1,US,70106$ usd,25.0,2500.0,"Fernandez Cliff, 07444"
1170,641078073ID,Chatter Collapse Chatter,Nonfiction,Publishing,USD,2012-08-09,2000.0,2012-06-11 22:49:01,85.0,failed,4,US,87731$ usd,85.0,2000.0,"Blake Terrace, 75028"
297,753595781ID,The Ladykiller,Shorts,Film & Video,USD,2010-07-24,1234.0,2010-06-24 19:25:00,0.0,failed,0,US,76980$ usd,0.0,1234.0,"Hill Drive, 29409"


### Display a summary of information from your dataframe

In [86]:
df.info() # With .info() we can see how many rows and columns the DataFrame has, 
          # the index numbers, the column names, whether they contain values or are empty, 
          # and finally, the data types of each column


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  2000 non-null   object 
 1   name                2000 non-null   object 
 2   category            2000 non-null   object 
 3   main_category       2000 non-null   object 
 4   currency            2000 non-null   object 
 5   deadline            2000 non-null   object 
 6   goal                2000 non-null   float64
 7   launched            2000 non-null   object 
 8   pledged             2000 non-null   float64
 9   state               2000 non-null   object 
 10  backers             2000 non-null   int64  
 11  country             2000 non-null   object 
 12  usd pledged         2000 non-null   object 
 13  usd_pledged_result  2000 non-null   float64
 14  usd_funding_goal    2000 non-null   float64
 15  company_address     2000 non-null   object 
dtypes: flo

### Print row 5 to row 10 (including 10) using .iloc

In [87]:
df.iloc[5:11] # df.iloc[5:11] selects rows using integer-location based indexing. 
              # 5 is the row we start viewing, and 10 is the last row we want to see. 
              # In Python, the end index is exclusive, so writing 11 means we actually view up to row 10


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address
5,2021765829ID,Feels Like Home is Going On Tour,Rock,Music,USD,2013-05-01,5000.0,2013-04-10 01:42:03,5490.69,successful,93,US,74996$ usd,5490.69,5000.0,"Gonzalez Landing, 20367"
6,923683665ID,Sarah’s Journey: Empowering the Dreams of Disp...,Children's Books,Publishing,SEK,2016-12-19,218000.0,2016-11-15 19:13:46,227101.0,successful,352,SE,67246$ usd,24593.47,23607.89,"Hall Loaf, 09759"
7,397532230ID,James Clark's Lonely Hearts Rugby Clwb Film & ...,Music,Music,GBP,2014-02-04,300.0,2014-01-07 23:34:41,624.0,successful,26,GB,61172$ usd,1018.79,489.8,"Cunningham Throughway, 66540"
8,181163322ID,Chosen Kin: A New Superhero Series & Interacti...,Webseries,Film & Video,USD,2017-02-02,5000.0,2017-01-03 06:54:08,156.0,failed,5,US,96858$ usd,156.0,5000.0,"Bullock Skyway, 28207"
9,1616507003ID,Adventures in Oz: Beyond the Deadly Desert,Games,Games,USD,2011-12-02,3000.0,2011-11-01 00:00:10,520.0,failed,17,US,82928$ usd,520.0,3000.0,"Harris Estates, 48703"
10,373344194ID,Moonlight Vulture,Restaurants,Food,USD,2016-01-04,10000.0,2015-12-14 15:23:00,10727.0,successful,102,US,51152$ usd,10727.0,10000.0,"Bryant Manor, 16806"


### Change the index (row names) so that the index is based on the existing column ID; the change should be permanent (do not use inplace)

In [88]:
df = df.set_index('ID')   # With the set_index() method, we change the index to 'ID'. 
df                        # By assigning it back to df with '=', the change becomes permanent

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1054480724ID,Luminoodle COLOR & BASECAMP: Versatile & Porta...,Product Design,Design,USD,2016-04-22,25000.0,2016-03-22 16:39:57,408052.0,successful,5085,US,98849$ usd,408052.00,25000.00,"Mary Grove, 30815"
1317034686ID,Liberty's work was accepted into the Library o...,Shorts,Film & Video,USD,2017-02-23,1100.0,2017-02-03 07:26:57,1106.0,successful,10,US,52486$ usd,1106.00,1100.00,"Garcia Tunnel, 77525"
121816660ID,Dusk - Open World Survival Horror (Canceled),Video Games,Games,GBP,2015-12-19,50000.0,2015-11-18 01:53:10,1625.0,canceled,58,GB,53482$ usd,2422.88,74550.09,"Payne Isle, 42813"
49882738ID,Eatsleeptrainrepeat,Web,Technology,SEK,2014-12-24,180000.0,2014-11-24 21:47:27,0.0,failed,0,SE,64632$ usd,0.00,23054.76,"Laura Stream, 08932"
1503277376ID,FLARE Video Project,Shorts,Film & Video,USD,2011-03-12,5000.0,2011-02-10 20:32:26,5025.0,successful,9,US,99272$ usd,5025.00,5000.00,"Andrew Pike, 24782"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
867650362ID,SideWayz Theatre's First Production Fund,Theater,Theater,USD,2016-03-23,2500.0,2016-03-03 23:30:22,25.0,failed,1,US,70106$ usd,25.00,2500.00,"Fernandez Cliff, 07444"
1512412268ID,THE LOVE & MISERY OF A MOONSHINE MIND Project,Publishing,Publishing,USD,2012-10-11,4850.0,2012-09-14 23:43:32,40.0,failed,2,US,77334$ usd,40.00,4850.00,"Williams Junction, 45598"
508267839ID,Mall-World Animated Series - Comedy,Animation,Film & Video,USD,2016-12-19,25000.0,2016-11-04 00:45:31,86.0,failed,2,US,95565$ usd,86.00,25000.00,"Elaine Point, 97758"
945357371ID,Behind the Food Carts Road Trip,Photography,Photography,USD,2014-02-04,5000.0,2014-01-14 23:48:24,5610.0,successful,77,US,86498$ usd,5610.00,5000.00,"Nancy Dam, 25971"


### 7 Display the rows with indices "121816660ID" to "1503277376ID" using .loc
- Explain why a dataframe is in many ways a mixture of a list and a dictionary; use .iloc and .loc when making your argument.

In [89]:
df.loc['121816660ID':'1503277376ID'] # Selected rows from a starting index to an ending index using .loc. 
                                     # Since the index consists of strings , we can use the actual index names 
                                     # instead of counting positions or writing a specific row number.
# A DataFrame is like a list and a dictionary, 
# because .iloc accesses rows by position (like a list) and .loc accesses rows by index labels (like a dictionary)


Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
121816660ID,Dusk - Open World Survival Horror (Canceled),Video Games,Games,GBP,2015-12-19,50000.0,2015-11-18 01:53:10,1625.0,canceled,58,GB,53482$ usd,2422.88,74550.09,"Payne Isle, 42813"
49882738ID,Eatsleeptrainrepeat,Web,Technology,SEK,2014-12-24,180000.0,2014-11-24 21:47:27,0.0,failed,0,SE,64632$ usd,0.0,23054.76,"Laura Stream, 08932"
1503277376ID,FLARE Video Project,Shorts,Film & Video,USD,2011-03-12,5000.0,2011-02-10 20:32:26,5025.0,successful,9,US,99272$ usd,5025.0,5000.0,"Andrew Pike, 24782"


### Display all rows with the currency "USD" by using a mask

In [90]:
mask = (df['currency']=='USD')   # Created a mask to select rows where the 'currency' is 'USD' and displayed only those rows
df[mask]

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1054480724ID,Luminoodle COLOR & BASECAMP: Versatile & Porta...,Product Design,Design,USD,2016-04-22,25000.0,2016-03-22 16:39:57,408052.00,successful,5085,US,98849$ usd,408052.00,25000.0,"Mary Grove, 30815"
1317034686ID,Liberty's work was accepted into the Library o...,Shorts,Film & Video,USD,2017-02-23,1100.0,2017-02-03 07:26:57,1106.00,successful,10,US,52486$ usd,1106.00,1100.0,"Garcia Tunnel, 77525"
1503277376ID,FLARE Video Project,Shorts,Film & Video,USD,2011-03-12,5000.0,2011-02-10 20:32:26,5025.00,successful,9,US,99272$ usd,5025.00,5000.0,"Andrew Pike, 24782"
2021765829ID,Feels Like Home is Going On Tour,Rock,Music,USD,2013-05-01,5000.0,2013-04-10 01:42:03,5490.69,successful,93,US,74996$ usd,5490.69,5000.0,"Gonzalez Landing, 20367"
181163322ID,Chosen Kin: A New Superhero Series & Interacti...,Webseries,Film & Video,USD,2017-02-02,5000.0,2017-01-03 06:54:08,156.00,failed,5,US,96858$ usd,156.00,5000.0,"Bullock Skyway, 28207"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
867650362ID,SideWayz Theatre's First Production Fund,Theater,Theater,USD,2016-03-23,2500.0,2016-03-03 23:30:22,25.00,failed,1,US,70106$ usd,25.00,2500.0,"Fernandez Cliff, 07444"
1512412268ID,THE LOVE & MISERY OF A MOONSHINE MIND Project,Publishing,Publishing,USD,2012-10-11,4850.0,2012-09-14 23:43:32,40.00,failed,2,US,77334$ usd,40.00,4850.0,"Williams Junction, 45598"
508267839ID,Mall-World Animated Series - Comedy,Animation,Film & Video,USD,2016-12-19,25000.0,2016-11-04 00:45:31,86.00,failed,2,US,95565$ usd,86.00,25000.0,"Elaine Point, 97758"
945357371ID,Behind the Food Carts Road Trip,Photography,Photography,USD,2014-02-04,5000.0,2014-01-14 23:48:24,5610.00,successful,77,US,86498$ usd,5610.00,5000.0,"Nancy Dam, 25971"


### The "pledged" column has quite a few values with 0.0. Change these to the NaN data type instead, using .loc for example.

In [91]:
df['pledged'] = df['pledged'].replace(0.0 , np.nan)   # I selected the 'pledged' column and replaced all 0.0 values with NaN
df                                                    # By assigning it back to df['pledged'], I made the change permanent in the DataFrame



Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
1054480724ID,Luminoodle COLOR & BASECAMP: Versatile & Porta...,Product Design,Design,USD,2016-04-22,25000.0,2016-03-22 16:39:57,408052.0,successful,5085,US,98849$ usd,408052.00,25000.00,"Mary Grove, 30815"
1317034686ID,Liberty's work was accepted into the Library o...,Shorts,Film & Video,USD,2017-02-23,1100.0,2017-02-03 07:26:57,1106.0,successful,10,US,52486$ usd,1106.00,1100.00,"Garcia Tunnel, 77525"
121816660ID,Dusk - Open World Survival Horror (Canceled),Video Games,Games,GBP,2015-12-19,50000.0,2015-11-18 01:53:10,1625.0,canceled,58,GB,53482$ usd,2422.88,74550.09,"Payne Isle, 42813"
49882738ID,Eatsleeptrainrepeat,Web,Technology,SEK,2014-12-24,180000.0,2014-11-24 21:47:27,,failed,0,SE,64632$ usd,0.00,23054.76,"Laura Stream, 08932"
1503277376ID,FLARE Video Project,Shorts,Film & Video,USD,2011-03-12,5000.0,2011-02-10 20:32:26,5025.0,successful,9,US,99272$ usd,5025.00,5000.00,"Andrew Pike, 24782"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
867650362ID,SideWayz Theatre's First Production Fund,Theater,Theater,USD,2016-03-23,2500.0,2016-03-03 23:30:22,25.0,failed,1,US,70106$ usd,25.00,2500.00,"Fernandez Cliff, 07444"
1512412268ID,THE LOVE & MISERY OF A MOONSHINE MIND Project,Publishing,Publishing,USD,2012-10-11,4850.0,2012-09-14 23:43:32,40.0,failed,2,US,77334$ usd,40.00,4850.00,"Williams Junction, 45598"
508267839ID,Mall-World Animated Series - Comedy,Animation,Film & Video,USD,2016-12-19,25000.0,2016-11-04 00:45:31,86.0,failed,2,US,95565$ usd,86.00,25000.00,"Elaine Point, 97758"
945357371ID,Behind the Food Carts Road Trip,Photography,Photography,USD,2014-02-04,5000.0,2014-01-14 23:48:24,5610.0,successful,77,US,86498$ usd,5610.00,5000.00,"Nancy Dam, 25971"


### 10 Print all rows that have NaN with their actual values (i.e., it should not just be True / False). You must use a specific method/function.

In [92]:
df[df.isna().any(axis=1)]                                       # .any(axis=1) checks each row to see if it contains any NaN, 
                                                                # and df[] in the beginning selects only those rows 
                                                                #  without df[] it would show only True and False


Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
49882738ID,Eatsleeptrainrepeat,Web,Technology,SEK,2014-12-24,180000.0,2014-11-24 21:47:27,,failed,0,SE,64632$ usd,0.0,23054.76,"Laura Stream, 08932"
1384707351ID,Batman Band Builder for Ios and Android (Cance...,Games,Games,GBP,2015-10-28,500.0,2015-09-28 21:17:03,,canceled,0,GB,50983$ usd,0.0,766.99,"Barbara Islands, 71175"
1511392096ID,Criss Cross Quail acres,Farms,Food,USD,2015-02-20,30000.0,2015-01-21 21:23:08,,failed,0,US,80915$ usd,0.0,30000.00,"Stacey Loaf, 25341"
1616831393ID,Help us construct with music (Canceled),Music,Music,USD,2011-04-09,10000.0,2011-03-10 05:20:04,,canceled,0,US,86018$ usd,0.0,10000.00,"Jill Course, 75981"
1992107184ID,Community Art Wall (Canceled),Public Art,Art,USD,2013-07-28,6750.0,2013-05-29 22:00:17,,canceled,0,US,85307$ usd,0.0,6750.00,"Collins Pike, 12819"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
625585267ID,Rosie the hippo: Do I have to share my toys? (...,Children's Books,Publishing,USD,2015-12-16,3000.0,2015-11-16 18:31:08,,canceled,0,US,61783$ usd,0.0,3000.00,"Jennifer Trail, 37296"
615895406ID,Kombat Komedian 2,Experimental,Film & Video,EUR,2015-08-15,2000.0,2015-06-18 01:01:07,,failed,0,IT,63965$ usd,0.0,2256.19,"Harmon Walk, 31405"
603589651ID,The beginning,Animation,Film & Video,DKK,2017-09-26,35000.0,2017-08-27 23:22:06,,failed,0,DK,77101$ usd,0.0,5552.21,"Shaw Shore, 91720"
374290795ID,Soul Sessions with Yolanda Stephens (Canceled),Music,Music,USD,2015-09-30,50000.0,2015-08-18 21:59:21,,canceled,0,US,71133$ usd,0.0,50000.00,"Fitzpatrick Alley, 18274"


### Create a new column called pledged_real_diff that represents the difference in USD between pledged usd and usd_pledged_result.
Hint: You will need to "clean" "pledged usd" so that you can perform mathematical operations on it by first cleaning away unnecessary text and then converting it to a float.

In [93]:
df['usd pledged'].str.len().unique()                                    # I checked how many characters each cell in 'usd pledged' has
df[~df['usd pledged'].str.lower().str.endswith('$ usd')]                # I checked if all values end with '$ usd'
df['usd pledged'] =  df['usd pledged'].str[:-5].astype(float)           # I removed the last 5 characters ('$ usd') and converted the column to float, so I can subtract numbers
df['usd pledged']                                                       # I checked if the change worked
df['pledged_real_diff'] = df['usd pledged'] - df['usd_pledged_result']  # I created a new column by subtracting 'usd_pledged_result' from 'usd pledged'
df                                                                      # I checked the final DataFrame

array([10])

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1


ID
1054480724ID    98849.0
1317034686ID    52486.0
121816660ID     53482.0
49882738ID      64632.0
1503277376ID    99272.0
                 ...   
867650362ID     70106.0
1512412268ID    77334.0
508267839ID     95565.0
945357371ID     86498.0
1093469237ID    72847.0
Name: usd pledged, Length: 2000, dtype: float64

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address,pledged_real_diff
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1054480724ID,Luminoodle COLOR & BASECAMP: Versatile & Porta...,Product Design,Design,USD,2016-04-22,25000.0,2016-03-22 16:39:57,408052.0,successful,5085,US,98849.0,408052.00,25000.00,"Mary Grove, 30815",-309203.00
1317034686ID,Liberty's work was accepted into the Library o...,Shorts,Film & Video,USD,2017-02-23,1100.0,2017-02-03 07:26:57,1106.0,successful,10,US,52486.0,1106.00,1100.00,"Garcia Tunnel, 77525",51380.00
121816660ID,Dusk - Open World Survival Horror (Canceled),Video Games,Games,GBP,2015-12-19,50000.0,2015-11-18 01:53:10,1625.0,canceled,58,GB,53482.0,2422.88,74550.09,"Payne Isle, 42813",51059.12
49882738ID,Eatsleeptrainrepeat,Web,Technology,SEK,2014-12-24,180000.0,2014-11-24 21:47:27,,failed,0,SE,64632.0,0.00,23054.76,"Laura Stream, 08932",64632.00
1503277376ID,FLARE Video Project,Shorts,Film & Video,USD,2011-03-12,5000.0,2011-02-10 20:32:26,5025.0,successful,9,US,99272.0,5025.00,5000.00,"Andrew Pike, 24782",94247.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
867650362ID,SideWayz Theatre's First Production Fund,Theater,Theater,USD,2016-03-23,2500.0,2016-03-03 23:30:22,25.0,failed,1,US,70106.0,25.00,2500.00,"Fernandez Cliff, 07444",70081.00
1512412268ID,THE LOVE & MISERY OF A MOONSHINE MIND Project,Publishing,Publishing,USD,2012-10-11,4850.0,2012-09-14 23:43:32,40.0,failed,2,US,77334.0,40.00,4850.00,"Williams Junction, 45598",77294.00
508267839ID,Mall-World Animated Series - Comedy,Animation,Film & Video,USD,2016-12-19,25000.0,2016-11-04 00:45:31,86.0,failed,2,US,95565.0,86.00,25000.00,"Elaine Point, 97758",95479.00
945357371ID,Behind the Food Carts Road Trip,Photography,Photography,USD,2014-02-04,5000.0,2014-01-14 23:48:24,5610.0,successful,77,US,86498.0,5610.00,5000.00,"Nancy Dam, 25971",80888.00


### Display how many Kickstarter projects there are broken down by category using the main_category column

In [94]:
df['main_category'].value_counts() # I counted how many times each value appears in the column to see the distribution of categories

main_category
Film & Video    375
Music           272
Publishing      198
Games           186
Art             154
Design          153
Technology      150
Food            142
Fashion         102
Photography      62
Theater          57
Crafts           56
Comics           47
Journalism       23
Dance            23
Name: count, dtype: int64

### What is the highest result for usd_pledged_result? You must use a method.

In [95]:
df['usd_pledged_result'].max() # I checked the maximum value in the column to see the highest amount pledged

np.float64(1021481.0)

### Show the projects with the lowest "usd_pledged_result" - use .min()
You should not only show the value, but the entire rows.
Tip: Use a mask!

In [96]:
mask= df['usd_pledged_result']== df['usd_pledged_result'].min()  # I created a mask to find the rows with the smallest value, 
df[mask]                                                         # then used it to display those rows from the DataFrame

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address,pledged_real_diff
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
49882738ID,Eatsleeptrainrepeat,Web,Technology,SEK,2014-12-24,180000.0,2014-11-24 21:47:27,,failed,0,SE,64632.0,0.0,23054.76,"Laura Stream, 08932",64632.0
1384707351ID,Batman Band Builder for Ios and Android (Cance...,Games,Games,GBP,2015-10-28,500.0,2015-09-28 21:17:03,,canceled,0,GB,50983.0,0.0,766.99,"Barbara Islands, 71175",50983.0
1511392096ID,Criss Cross Quail acres,Farms,Food,USD,2015-02-20,30000.0,2015-01-21 21:23:08,,failed,0,US,80915.0,0.0,30000.00,"Stacey Loaf, 25341",80915.0
1616831393ID,Help us construct with music (Canceled),Music,Music,USD,2011-04-09,10000.0,2011-03-10 05:20:04,,canceled,0,US,86018.0,0.0,10000.00,"Jill Course, 75981",86018.0
1992107184ID,Community Art Wall (Canceled),Public Art,Art,USD,2013-07-28,6750.0,2013-05-29 22:00:17,,canceled,0,US,85307.0,0.0,6750.00,"Collins Pike, 12819",85307.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
625585267ID,Rosie the hippo: Do I have to share my toys? (...,Children's Books,Publishing,USD,2015-12-16,3000.0,2015-11-16 18:31:08,,canceled,0,US,61783.0,0.0,3000.00,"Jennifer Trail, 37296",61783.0
615895406ID,Kombat Komedian 2,Experimental,Film & Video,EUR,2015-08-15,2000.0,2015-06-18 01:01:07,,failed,0,IT,63965.0,0.0,2256.19,"Harmon Walk, 31405",63965.0
603589651ID,The beginning,Animation,Film & Video,DKK,2017-09-26,35000.0,2017-08-27 23:22:06,,failed,0,DK,77101.0,0.0,5552.21,"Shaw Shore, 91720",77101.0
374290795ID,Soul Sessions with Yolanda Stephens (Canceled),Music,Music,USD,2015-09-30,50000.0,2015-08-18 21:59:21,,canceled,0,US,71133.0,0.0,50000.00,"Fitzpatrick Alley, 18274",71133.0


### Find all rows containing the word "Cross" in the "name" column using string methods and display the highest value for the "pledged" column.

In [97]:
df[df['name'].str.contains('Cross' , case=False , na=False)].sort_values('pledged' , ascending=False)
# I filtered the DataFrame to include only projects with 'Cross' in their name (case-insensitive),
# then sorted these rows by the 'pledged' amount in descending order to see the highest pledged projects first

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_result,usd_funding_goal,company_address,pledged_real_diff
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1180112226ID,2015/2016 Cyclocross Album,Art Books,Publishing,GBP,2016-02-06,17500.0,2016-01-07 22:07:34,19229.0,successful,379,GB,88481.0,27890.35,25382.55,"Troy Falls, 51029",60590.65
2084671684ID,Help Cross Record make a full length album!,Indie Rock,Music,USD,2012-02-18,8500.0,2012-01-14 21:15:28,8618.5,successful,99,US,64771.0,8618.5,8500.0,"Jennifer Trail, 46231",56152.5
640879014ID,The Cross Brush by 'Iron Sharpens Iron Golf',Product Design,Design,USD,2015-04-14,20000.0,2015-02-25 05:25:01,5281.0,failed,38,US,70357.0,5281.0,20000.0,"Lisa Springs, 73213",65076.0
628240076ID,Crossroads (Short Film),Thrillers,Film & Video,USD,2017-03-06,5000.0,2017-02-04 21:00:11,5128.51,successful,78,US,55935.0,5128.51,5000.0,"Delgado Field, 88721",50806.49
343618386ID,A Moment In The Movement - A cultural lacrosse...,Documentary,Film & Video,USD,2013-07-14,45000.0,2013-06-24 21:55:50,150.0,failed,3,US,68330.0,150.0,45000.0,"Elizabeth Vista, 44228",68180.0
1511392096ID,Criss Cross Quail acres,Farms,Food,USD,2015-02-20,30000.0,2015-01-21 21:23:08,,failed,0,US,80915.0,0.0,30000.0,"Stacey Loaf, 25341",80915.0
583947260ID,I Wonder: A Cross Curricular Textbook,Publishing,Publishing,USD,2013-07-11,1200.0,2013-06-11 01:57:20,,failed,0,US,52028.0,0.0,1200.0,"Brown Turnpike, 38840",52028.0


## Section 2

### Display how many Kickstarter projects there are per category using the main_category column. Display the top 5 sorted from highest to lowest.

In [98]:
df['main_category'].value_counts(ascending= False).head()
# counts how many times each unique value appears in the colum 
# sorts the counts from largest to smallest
# .head() shows only the top 5 most common categories

main_category
Film & Video    375
Music           272
Publishing      198
Games           186
Art             154
Name: count, dtype: int64

### Display the top 10 most uncommon COMBINATIONS of "main_category" and "category" where the country is "US"

In [99]:
df[df['country']=='US'][['main_category', 'category']].value_counts().sort_values(ascending= True).head(10)
# I filtered the DataFrame to include only rows where country is 'US',
# then counted how many times each combination of 'main_category' and 'category' occurs,
# sorted the counts from least to most frequent,
# and displayed the top 10 rarest combinations

main_category  category     
Crafts         Knitting         1
Fashion        Ready-to-wear    1
               Footwear         1
Design         Typography       1
Crafts         Crochet          1
               Quilts           1
               Stationery       1
Dance          Workshops        1
Design         Civic Design     1
Food           Bacon            1
Name: count, dtype: int64

### Group the projects based on main_category, what is the average funding goal (usd_funding_goal) they have? Sort the result from high to low.

In [100]:
df.groupby('main_category')['usd_funding_goal'].mean().sort_values(ascending=False)
# I grouped the projects by 'main_category', calculated the average 'usd_funding_goal' for each group,
# and sorted the results from highest to lowest average funding goal

main_category
Art             666092.309286
Technology       50913.101867
Games            33791.858656
Food             31883.745070
Theater          27964.739298
Film & Video     22852.615627
Publishing       20830.815909
Design           19979.825948
Fashion          15168.148725
Music            11747.886985
Photography       9095.636613
Crafts            7852.780536
Journalism        7384.004348
Dance             5884.845217
Comics            4545.462553
Name: usd_funding_goal, dtype: float64

### Display a summary of the same grouping as the previous task. It should now contain several columns with at least mean, max, min, count
- First, use .describe() and display your output
- Use .agg() and display your output, you should only include mean, max, min, count

In [101]:
df.groupby('main_category')['usd_funding_goal'].describe()
# I grouped the projects by 'main_category' and applied .describe() to 'usd_funding_goal'.
# This displays a summary of statistics for each group, including count, mean, (std),
#  (min), quartiles , and  (max) funding goals

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
main_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Art,154.0,666092.309286,8057100.0,75.0,1500.0,3491.06,10000.0,100000000.0
Comics,47.0,4545.462553,4111.874,99.0,1500.0,3500.0,6000.0,22240.34
Crafts,56.0,7852.780536,14391.11,20.0,706.16,2750.0,8000.0,69629.8
Dance,23.0,5884.845217,6328.256,350.0,2100.0,4000.0,6529.495,24958.4
Design,153.0,19979.825948,31125.54,1.0,3300.0,10000.0,22500.0,228060.6
Fashion,102.0,15168.148725,50881.61,158.18,1500.0,5000.0,12818.4025,500000.0
Film & Video,375.0,22852.615627,58320.66,15.0,2244.9,6000.0,20000.0,750000.0
Food,142.0,31883.74507,66012.81,0.89,3550.0,10000.0,33200.4075,500000.0
Games,186.0,33791.858656,97878.67,15.0,2716.6325,9000.0,20821.8175,792376.3
Journalism,23.0,7384.004348,7684.875,500.0,1857.015,4865.0,9250.0,27000.0


In [102]:
df.groupby('main_category')['usd_funding_goal'].agg(['mean' , 'max','min', 'count'])
# Grouped the projects by 'main_category' and used .agg() to apply multiple aggregation functions
# at once on 'usd_funding_goal'. It calculates only the mean, max, min, and count for each group,
# unlike .describe(), which returns additional statistics like std and quartiles.

Unnamed: 0_level_0,mean,max,min,count
main_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Art,666092.309286,100000000.0,75.0,154
Comics,4545.462553,22240.34,99.0,47
Crafts,7852.780536,69629.8,20.0,56
Dance,5884.845217,24958.4,350.0,23
Design,19979.825948,228060.6,1.0,153
Fashion,15168.148725,500000.0,158.18,102
Film & Video,22852.615627,750000.0,15.0,375
Food,31883.74507,500000.0,0.89,142
Games,33791.858656,792376.3,15.0,186
Journalism,7384.004348,27000.0,500.0,23


### Group based on the "state" column and display the average number of backers (i.e., the number of people who paid money to support the project). What conclusion do you draw? Max 2 sentences is enough.

In [103]:
df.groupby('state')['backers'].mean()
# Grouped the projects by 'state' and calculated the average number of backers for each state.
# This shows how many people, on average, backed projects in each state.
# Projects that were successful or live tend to have more backers on average, 
# while failed or cancelled projects have fewer backers.

state
canceled       22.548544
failed         17.244782
live           29.142857
successful    198.752890
suspended      34.555556
undefined       0.000000
Name: backers, dtype: float64

### Use .apply() 
1. Investigate which categories exist in the "state" column.
2. Create a function that, based on the values in the "state" column, modifies the "usd_funding_goal" column with a * if it says "successful", a ! for "failed" or "canceled", and --- for "suspended".
3. Display a sample of 20 rows from your modified column.

In [104]:
df['state'].apply(str).unique()   # Converted all 'state' values to strings and displayed the unique project states in the DataFrame

array(['successful', 'canceled', 'failed', 'live', 'undefined',
       'suspended'], dtype=object)

In [105]:
def modify_goal(state, goal):                    # Defined a function to modify the funding goal based on project state: added  * ! --- in the ends
    if state == 'successful':                    # Created a new column by applying this function row by row
        return str(goal) + ' *'                  # Converted the numeric funding goal to a string so symbols could be combined safely.
    elif state in ['failed', 'canceled']:  
        return str(goal) + ' !'
    elif state == 'suspended':
        return str(goal) + ' ---'
    else:
        return str(goal)
    
df['modified'] = df.apply(lambda row: modify_goal(row['state'],row['usd_funding_goal']), axis=1 )

In [106]:
df[['state','usd_funding_goal', 'modified']].sample(20) 
# Displayed a random sample of 20 rows showing the 'state','usd_funding_goal', and the 'modified' column

Unnamed: 0_level_0,state,usd_funding_goal,modified
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2042882369ID,successful,50.0,50.0 *
317569648ID,successful,15000.0,15000.0 *
494969932ID,successful,1500.0,1500.0 *
163995856ID,failed,50000.0,50000.0 !
472003649ID,failed,150000.0,150000.0 !
1653555759ID,successful,350.0,350.0 *
1808481958ID,successful,8000.0,8000.0 *
2039040796ID,failed,10200.0,10200.0 !
987394416ID,successful,25000.0,25000.0 *
264051908ID,successful,3000.0,3000.0 *
