# Introduction to Pandas

PFoo andas is a powerful data manipulation library in Python, widely used for data analysis tasks. It provides data structures like `DataFrame` and `Series` that make it easy to work with structured data.

# Opening a CSV File

We typically start by loading data from a file. One of the most common formats is CSV (Comma-Separated Values). Pandas provides a `read_csv()` function to load data from a CSV file into a `DataFrame`.

In [1]:
import pandas as pd

# Example: Loading data from a CSV file
df = pd.read_csv('Datasets/gapminder.csv')

## DataFrame Shape

The `shape` attribute of a DataFrame provides the dimensions of the DataFrame as a tuple (number of rows, number of columns). 

### Why doesn't `shape` have `()`?

`shape` is an attribute, not a method. This means it directly stores information about the DataFrame, so you don't need to call it as a function with parentheses.

In [2]:
# Example: Checking the shape of the DataFrame
df_shape = df.shape
print("Shape of the DataFrame:", df_shape)

Shape of the DataFrame: (1704, 6)


## DataFrame Columns

The `columns` attribute returns an Index object containing the column names of the DataFrame. This is useful for identifying the features or variables available in the dataset.

In [4]:
# Example: Viewing the column names of the DataFrame
df_columns = df.columns
print("Columns in the DataFrame:", df_columns)

Columns in the DataFrame: Index(['country', 'continent', 'year', 'lifeExp', 'pop', 'gdpPercap'], dtype='object')


## DataFrame Info

The `info()` method provides a concise summary of the DataFrame, including the number of non-null entries, data types of each column, and memory usage. This is particularly helpful for getting a quick overview of the dataset.

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   country    1704 non-null   object 
 1   continent  1704 non-null   object 
 2   year       1704 non-null   int64  
 3   lifeExp    1704 non-null   float64
 4   pop        1704 non-null   int64  
 5   gdpPercap  1704 non-null   float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB


## Descriptive Statistics

The `describe()` method generates descriptive statistics of the DataFrame, such as mean, standard deviation, min, and max for numeric columns. This is useful for understanding the distribution of data.

In [6]:
# Example: Generating descriptive statistics
df_description = df.describe()
df_description

Unnamed: 0,year,lifeExp,pop,gdpPercap
count,1704.0,1704.0,1704.0,1704.0
mean,1979.5,59.474439,29601210.0,7215.327081
std,17.26533,12.917107,106157900.0,9857.454543
min,1952.0,23.599,60011.0,241.165876
25%,1965.75,48.198,2793664.0,1202.060309
50%,1979.5,60.7125,7023596.0,3531.846988
75%,1993.25,70.8455,19585220.0,9325.462346
max,2007.0,82.603,1318683000.0,113523.1329


## Head and Tail

- `head()`: Displays the first 5 rows of the DataFrame by default, which is useful for a quick glance at the data.
- `tail()`: Displays the last 5 rows of the DataFrame by default, useful for checking data at the end of the DataFrame.

In [7]:
# Example: Displaying the first few rows
df.head()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
0,Afghanistan,Asia,1952,28.801,8425333,779.445314
1,Afghanistan,Asia,1957,30.332,9240934,820.85303
2,Afghanistan,Asia,1962,31.997,10267083,853.10071
3,Afghanistan,Asia,1967,34.02,11537966,836.197138
4,Afghanistan,Asia,1972,36.088,13079460,739.981106


In [8]:
# Example: Displaying the last few rows
df.tail()

Unnamed: 0,country,continent,year,lifeExp,pop,gdpPercap
1699,Zimbabwe,Africa,1987,62.351,9216418,706.157306
1700,Zimbabwe,Africa,1992,60.377,10704340,693.420786
1701,Zimbabwe,Africa,1997,46.809,11404948,792.44996
1702,Zimbabwe,Africa,2002,39.989,11926563,672.038623
1703,Zimbabwe,Africa,2007,43.487,12311143,469.709298


## Selecting Columns

You can select one or more columns from a DataFrame by passing the column names inside square brackets `[]`.

In [9]:
# Example: Selecting a single column
selected_column = df['year']
print(selected_column.head())

0    1952
1    1957
2    1962
3    1967
4    1972
Name: year, dtype: int64


In [10]:
# Example: Selecting multiple columns
selected_columns = df[['country', 'pop']]
print(selected_columns.head())

       country       pop
0  Afghanistan   8425333
1  Afghanistan   9240934
2  Afghanistan  10267083
3  Afghanistan  11537966
4  Afghanistan  13079460


## Subsetting with `loc`

The `loc` method is used for label-based indexing, and rows and columns can be selected based on their labels. When you use `loc[]` for selection, you have more control and flexibility. `loc[]` is used for label-based indexing, allowing you to select rows, columns, or both simultaneously.

In [4]:
# Example: Subsetting rows and columns using `loc`
subset_loc = df.loc[0:3, ['year','country', 'pop']]
subset_loc

Unnamed: 0,year,country,pop
0,1952,Afghanistan,8425333
1,1957,Afghanistan,9240934
2,1962,Afghanistan,10267083
3,1967,Afghanistan,11537966


## Subsetting with `iloc`

The `iloc` method is used for position-based indexing and can select rows and columns based on their integer positions.

In [12]:
# Example: Subsetting rows and columns using `iloc`
subset_iloc = df.iloc[0:5, [0, 1]]
subset_iloc

Unnamed: 0,country,continent
0,Afghanistan,Asia
1,Afghanistan,Asia
2,Afghanistan,Asia
3,Afghanistan,Asia
4,Afghanistan,Asia


## Slicing Data

Slicing refers to selecting a subset of rows or columns. You can slice data by specifying the range of indices.

In [13]:
# Example: Slicing rows 10 to 20 and columns 2 to 4
sliced_data = df.iloc[10:20, 2:5]
sliced_data

Unnamed: 0,year,lifeExp,pop
10,2002,42.129,25268405
11,2007,43.828,31889923
12,1952,55.23,1282697
13,1957,59.28,1476505
14,1962,64.82,1728137
15,1967,66.22,1984060
16,1972,67.69,2263554
17,1977,68.93,2509048
18,1982,70.42,2780097
19,1987,72.0,3075321


## Grouping Data with `groupby`

The `groupby()` method is used to group data based on the values of one or more columns, often followed by an aggregation function to summarize the data.

In [14]:
# Example: Grouping data by a column and calculating the mean for each group
grouped_data = df.groupby('year')['lifeExp'].mean()
grouped_data

year
1952    49.057620
1957    51.507401
1962    53.609249
1967    55.678290
1972    57.647386
1977    59.570157
1982    61.533197
1987    63.212613
1992    64.160338
1997    65.014676
2002    65.694923
2007    67.007423
Name: lifeExp, dtype: float64

## Grouped Frequency

You can calculate the frequency of categories within groups by combining `groupby()` with the `size()` method or `count()`.

In [15]:
# Example: Calculating the frequency of values in a column within each group
grouped_frequency = df.groupby('country')['continent'].size()
grouped_frequency

country
Afghanistan           12
Albania               12
Algeria               12
Angola                12
Argentina             12
                      ..
Vietnam               12
West Bank and Gaza    12
Yemen, Rep.           12
Zambia                12
Zimbabwe              12
Name: continent, Length: 142, dtype: int64

# Challenge

Now that you've learned how to explore and manipulate a DataFrame using Pandas, try to complete the following tasks with a new dataset:

1. Load the `devon_and_cornwall_police.csv` file into a DataFrame.
2. Explore its shape, columns, and general information using `shape`, `columns`, `info()`, and `describe()`.
3. Select specific columns and subset the data using `loc` and `iloc`.
4. Group the data by a relevant column and calculate the mean for each group.
5. Calculate the frequency of another column within these groups.

### Bonus Challenges

1. Identify the top 3 locations with the highest number of crimes reported. What are these locations, and how many crimes were reported in each?
2. Choose a location (e.g., "On or near Supermarket") and determine the most common type of crime reported at that location. How many times was this crime reported?
3. For each crime type, what is the most frequent outcome category (e.g., "Under investigation", "Investigation complete; no suspect identified")? Are there any crime types that consistently lead to a specific outcome?

In [28]:
#1

In [29]:
#2

In [30]:
#3

In [31]:
#4

In [32]:
#5

In [33]:
#Bonus 1

In [34]:
#Bonus 2

In [35]:
#Bonus 3