# Chapter 1: Series

In [2]:
import numpy as np
import pandas as pd

1.2 Exercise 1: Test scores

Create a series of 10 elements, random integers from 70-100, representing scores on a monthly exam. Set the index to be the month names, starting in September and ending in June. (If these months don’t match the school year in your location, then feel free to make them more realistic.)

In [3]:
# Create an index for the months of September to June
# This is a simple method to create a list of month names
index = 'Sep Oct Nov Dec Jan Feb Mar Apr May Jun'.split()

# A more efficient and flexible method to create a date index using pandas
# Here, we generate a date range starting from September 1, 2024,
# with a frequency of 'M' (month-end) for a total of 10 periods (months).
# The resulting dates are then formatted to display only the month names.
index = (
    pd.
    date_range(start='09-01-2024', freq='M', periods=10).
    strftime('%B')
)

# Initialize a random number generator with a fixed seed for reproducibility
rng = np.random.default_rng(1984)

# Generate an array of random integers between 70 and 100 (inclusive)
# The size of the array is 10, corresponding to the 10 months in our index.
data = rng.integers(70, 101, 10)

# Create a pandas Series using the generated random data and the month index
# The Series will have the month names as the index and the random integers as the values.
s = pd.Series(data, index=index)

# Display the first few entries of the Series to verify the output
s.head()


September    77
October      76
November     93
December     70
January      71
dtype: int64

With this series, we answer the following questions:

1) What is the student’s average test score for the entire year?

In [4]:
# Calculate the average value of the entire Series
# This represents the mean score for the entire year (September to June).
avg_year = s.mean()
avg_year

81.2

2) What is the student’s average test score during the first half of the year (i.e., the first five months)?

In [5]:
# Calculate the average value for the first half of the Series
# The first half consists of the first 5 months (September to January).
avg_first_half = s.loc['September':'January'].mean()
avg_first_half

77.4

3) What is the student’s average test score during the second half of the year? 

In [6]:
# Calculate the average value for the second half of the Series
# The second half consists of the last 5 months (February to June).
avg_second_half = s.loc['February':'June'].mean()
avg_second_half

85.0

4) Did the student improve their performance in the second half? If so, then by how much?

In [7]:
# Calculate the difference between the average scores of the second half and the first half
# This value indicates how much the average score has changed from the first half to the second half of the year.
score_diff = avg_second_half - avg_first_half

# Conditional statement to check if the score improved or reduced
if score_diff > 0:
    # If the score difference is positive, it means the average score improved
    # The score_diff is formatted to two decimal places for better readability
    print(f'Score improved by {score_diff:.2f}')
else:
    # If the score difference is zero or negative, it means the average score reduced or remained the same
    # The score_diff is formatted to two decimal places for better readability
    print(f'Score reduced by {score_diff:.2f}')

Score improved by 7.60


1.2.3 Beyond the exercise

1) In which month did this student get their highest score? 

In [8]:
# Find the index of the maximum value in the Series
# The argmax() function returns the index of the first occurrence of the maximum value.
idx = s.argmax()

# Retrieve the month corresponding to the maximum value using the index
# The index of the Series is accessed to get the month name associated with the maximum score.
max_month = s.index[idx]

# Print the month with the maximum score
print(f'The month with the highest score is: {max_month}')

The month with the highest score is: February


2) What were this student’s five highest scores? Round the student’s scores to the nearest 10. So a score of 82 would be rounded down to 80, but a score of 87 would be rounded up to 90

In [9]:
# Retrieve the top 5 largest values from the Series
# The nlargest() function returns the specified number of largest values in descending order.
# By default, it returns the top 5 values if no argument is provided.
largest_values = s.nlargest()

# Round the largest values to the nearest ten
# The round(-1) method rounds the values to the nearest multiple of 10.
rounded_largest_values = largest_values.round(-1)

# Print the rounded largest values
print(rounded_largest_values)

February    90
November    90
March       90
June        80
May         80
dtype: int64


1.3 Exercise 2: Scaling test scores

When I was in high school and college, our instructors would sometimes give tests that were extremely hard. Rather than fail most of the class, they would "scale" the test scores, known in some places as "grading on a curve." That is: They would assume that the average test score should be 80, calculate the difference between our actual mean and 80, then add that difference to each of our scores.

For this exercise, I want you to generate 10 test scores between 40 and 60, again using an index starting at September and ending with June. Find the mean of the scores, and add the difference between the mean and 80 to each of the scores.

In [10]:
# Initialize a random number generator with a fixed seed for reproducibility
# The seed value (1984) ensures that the random numbers generated can be reproduced.
rng = np.random.default_rng(1984)

# Generate an array of 10 random integers between 40 and 60 (inclusive)
# This simulates some data for the Series.
data = rng.integers(40, 61, 10)

# Create a pandas Series using the generated random data and the previously defined index
# The index should correspond to the months (e.g., September to June).
s = pd.Series(data, index=index)

# Calculate the average score of the Series
avg_score = s.mean()

# Calculate the difference between a target score (80) and the average score
# This value indicates how much the average score falls short of the target.
diff = 80 - avg_score

# Add the difference to each element in the Series
# This operation adjusts each score in the Series by the calculated difference.
adjusted_scores = s + diff

# Print the adjusted scores
print(adjusted_scores)

September    77.5
October      76.5
November     88.5
December     72.5
January      73.5
February     88.5
March        87.5
April        73.5
May          79.5
June         82.5
dtype: float64


1.3.3 Beyond the exercise

We can say anyone who scored within 1 standard deviation of the mean got a C (below the mean) or a B (above the mean). Anyone who scored more than 1 standard deviation above the mean got an A, and anyone who got more than one standard deviation below the mean got a D. 

First we compute the grades

In [11]:
# Compute the z-scores for the Series
# Z-scores indicate how many standard deviations an element is from the mean.
# The formula for z-score is (X - mean) / std, where std is the standard deviation.
# The parameter ddof=0 specifies that we are calculating the population standard deviation.
# This is appropriate when we consider the entire dataset as the population.
z_scores = (s - s.mean()) / s.std(ddof=0)

# Categorize the z-scores into letter grades using pd.cut
# The bins define the ranges for each grade:
# - D: z-scores less than -1
# - C: z-scores between -1 and 0
# - B: z-scores between 0 and 1
# - A: z-scores greater than 1
grades = pd.cut(z_scores, bins=[-np.inf, -1, 0, 1, np.inf], labels='D C B A'.split())

# Create a DataFrame to combine scores, z-scores, and grades
# This DataFrame will have three columns: 'score', 'z-score', and 'grade'.
student_scores = pd.DataFrame({'score': s, 'z-score': z_scores, 'grade': grades})

# Display the DataFrame containing scores, z-scores, and corresponding grades
student_scores


Unnamed: 0,score,z-score,grade
September,45,-0.412955,C
October,44,-0.578137,C
November,56,1.404048,A
December,40,-1.238866,D
January,41,-1.073684,D
February,56,1.404048,A
March,55,1.238866,A
April,41,-1.073684,D
May,47,-0.082591,C
June,50,0.412955,B


1) During which months did our student get an A, B, C, and D?

In [12]:
def get_months(grade):
    # Retrieve the months corresponding to the specified grade from the DataFrame
    # The loc method is used to filter the DataFrame for rows where the grade matches the input grade.
    months = student_scores.loc[student_scores.grade == grade].index.values
    
    # Return a formatted string indicating which months students received the specified grade
    return f"Students got grade {grade} in {', '.join(months)}"

# Loop through each grade and print the months in which students received that grade
for grade in 'A B C D'.split():
    print(get_months(grade))

Students got grade A in November, February, March
Students got grade B in June
Students got grade C in September, October, May
Students got grade D in December, January, April


2) Were there any test scores more than 2 standard deviations above or below the mean? If so, in which months? 

In [13]:
# Filter the DataFrame to find scores with z-scores greater than 2 or less than -2
# This identifies outliers, which are defined as scores that are more than two standard deviations away from the mean.
outliers = student_scores.loc[(student_scores['z-score'] > 2) | (student_scores['z-score'] < -2)]

# Display the outliers
outliers

Unnamed: 0,score,z-score,grade


3) How close are the mean and median to one another? 

In [14]:
# Calculate the mean (average) score from the student_scores DataFrame
mean = student_scores.score.mean()

# Calculate the median score from the student_scores DataFrame
median = student_scores.score.median()

# Calculate the difference between the mean and the median
# This value indicates the skewness of the score distribution:
# - A positive difference suggests a right (positive) skew.
# - A negative difference suggests a left (negative) skew.
diff = mean - median

# Display the difference
diff

1.5

4) What does it mean if they are close? What would it mean if they are far apart?

If the mean and median are close together, it indicates that the distribution of the data symmetric, and the more further apart they are the more skewed the distribution. It may also suggest the presence of outliers in the data which affect the mean more than the median

1.4 Exercise 3: Counting 10s digits

In this exercise, I want you to generate 10 random integers in the range 0 - 100. (Remember that the np.random.randint function returns numbers that include the lower bound, but exclude the upper bound.) Create a series containing those numbers' 10s digits. Thus, if our series contains 10, 25, 32, we want the series 1, 2, 3.

In [15]:
# Set the random seed for reproducibility
# This ensures that the random numbers generated can be reproduced in future runs.
np.random.seed(1984)

# Generate an array of 10 random integers between 0 and 100 (inclusive)
# The randint function generates random integers in the specified range.
data = np.random.randint(0, 101, 10)

# Create a pandas Series from the generated data, dividing each value by 10
# This effectively scales the random integers down to a range of 0 to 10.
s = pd.Series(data // 10)

# Display the Series
s


0    9
1    2
2    7
3    3
4    0
5    1
6    4
7    2
8    0
9    0
dtype: int32

1.4.3 Beyond the exercise
1) What if the range were from 0 - 10,000? How would that change your strategy, if at all?

In [16]:
# the above can be generalized if we recognize that it divisor is 10 times smaller than the max range
n_max = 1e4
divisor = int(n_max / 10)

# Set the random seed for reproducibility
# This ensures that the random numbers generated can be reproduced in future runs.
np.random.seed(1984)

# Generate an array of 10 random integers between 0 and 100 (inclusive)
# The randint function generates random integers in the specified range.
data = np.random.randint(0, n_max + 1, 10)

# Create a pandas Series from the generated data, dividing each value by 10
# This effectively scales the random integers down to a range of 0 to 10.
s = pd.Series(data // divisor)

# Display the Series
s

0    4
1    5
2    1
3    5
4    7
5    3
6    8
7    4
8    2
9    9
dtype: int32

2) Given a range from 0 to 10,000, what’s the smallest dtype we should use for our integers? 

In [17]:
# Calculate the number of bits required to represent a range of 10,000
# The log2 function computes the base-2 logarithm, which helps determine how many bits are needed.
# We need to find the smallest integer greater than log2(10000).
bits_required = np.log2(10000)

# Since we need an integer number of bits, we take the ceiling of the result
# to ensure we have enough bits to represent all values up to 10,000.
# We can also directly check if we need 14 bits, as 2^14 is greater than 10,000.
if bits_required < 14:
    bits_required = 14

# Choose the appropriate data type based on the number of bits required
# int16 can represent values from -32,768 to 32,767, which is sufficient for our range.
data_type = np.int16


3) Create a new series, with 10 floating-point values between 0 and 1,000. Find the numbers whose integer component (i.e., ignoring any fractional part) are even.

In [18]:
# Set the random seed for reproducibility
# This ensures that the random numbers generated can be reproduced in future runs.
np.random.seed(0)

# Generate a pandas Series of 10 random floating-point numbers uniformly distributed between 0 and 1000 (inclusive)
s = pd.Series(np.random.uniform(0, 1001, 10))

# Filter the Series to select only the values that correspond to even integers
# The astype(int) method converts the floating-point numbers to integers,
# and the modulo operator % is used to check if the integer is even (i.e., remainder is 0).
even_values = s.loc[s.astype(int) % 2 == 0]

# Display the filtered Series containing only even values
even_values


4    424.078454
5    646.540007
6    438.024798
7    892.664774
8    964.626423
dtype: float64

1.5 Exercise 4: Descriptive statistics

1) Generate a series of 100,000 floats in a normal distribution, with a mean at 0 and a standard deviation of 100.
2) Get the descriptive statistics for this series. How close are the mean and median? (You don’t need to calculate the difference, but rather consider why they aren’t the same.) 
3) Replace the minimum value with 5 times the maximum value.Get the descriptive statistics again. 
4) Did the mean, median, and standard deviations change from their previous values? (Again, it’s enough to see the difference without calculating it.) If so, why?

In [19]:
# Set the random seed for reproducibility
np.random.seed(1984)

# Generate a sample of 100,000 data points from a normal distribution
# with a mean of 0 and a standard deviation of 100
data = np.random.normal(loc=0, scale=100, size=100_000)

# Create a Pandas Series from the generated data
s = pd.Series(data)

# Calculate the mean of the data
mean = s.mean()

# Calculate the median of the data
median = s.median()

# Display descriptive statistics of the Series
description = s.describe()

# Print the results
print("Mean:", mean)
print("Median:", median)
print("Descriptive Statistics:\n", description)

# Note: The mean and median may not be the same because the sample
# is close to normal but not perfectly normal. Therefore, there may
# be some discrepancies due to the inherent variability in the data.


Mean: -0.0510804027746863
Median: -0.08262623080339143
Descriptive Statistics:
 count    100000.000000
mean         -0.051080
std         100.232729
min        -430.169171
25%         -67.653133
50%          -0.082626
75%          67.118507
max         396.458519
dtype: float64


In [20]:
# Calculate the mean and median of the original data
mean_original = s.mean()
median_original = s.median()

# Display descriptive statistics of the original Series
description_original = s.describe()

# Find the indices of the minimum and maximum values in the Series
min_idx = s.argmin()
max_idx = s.argmax()

# Modify the minimum value by setting it to five times the maximum value
s.iloc[min_idx] = 5 * s.iloc[max_idx]

# Check the descriptive statistics again after modification
description_modified = s.describe()

# Print the results before and after modification
print("Original Mean:", mean_original)
print("Original Median:", median_original)
print("Original Descriptive Statistics:\n", description_original)
print("\nModified Descriptive Statistics:\n", description_modified)

# Note: After modifying the minimum value, we can observe that the median
# is not greatly affected, but the mean and standard deviation are significantly
# impacted. This is because the mean is sensitive to extreme values, leading
# to an increase in both the mean and standard deviation.


Original Mean: -0.0510804027746863
Original Median: -0.08262623080339143
Original Descriptive Statistics:
 count    100000.000000
mean         -0.051080
std         100.232729
min        -430.169171
25%         -67.653133
50%          -0.082626
75%          67.118507
max         396.458519
dtype: float64

Modified Descriptive Statistics:
 count    100000.000000
mean         -0.026956
std         100.419353
min        -429.211074
25%         -67.651713
50%          -0.081459
75%          67.120860
max        1982.292596
dtype: float64


1.5.3 Beyond the exercise

1) Demonstrate that 68%, 95%, and 99.7% of the values in s are indeed within 1, 2, and 3 standard distributions of the mean. 


In [21]:
# Set the random seed for reproducibility
np.random.seed(1984)

# Function to check the proportion of values within a specified number of standard deviations from the mean
def checkLimits(ser, multiplier=1):
    """
    Calculate the proportion of values within a specified number of standard deviations from the mean.

    Parameters:
    ser (pd.Series): The input Pandas Series to analyze.
    multiplier (int or float): The number of standard deviations to consider.

    Returns:
    float: The proportion of values within the specified limits.
    """
    mean = ser.mean()  # Calculate the mean of the series
    std = ser.std()    # Calculate the standard deviation of the series
    
    # Count the number of values within the specified range
    count = ((ser >= mean - multiplier * std) & (ser <= mean + multiplier * std)).sum()
    
    # Return the proportion of values within the specified limits
    return count / len(ser)

# Check if values within 1, 2, and 3 standard deviations constitute the expected proportions
for i in range(1, 4):   
    proportion = checkLimits(s, i) * 100  # Calculate the percentage
    print(f'The values within {i} standard deviation constitute {proportion:.2f}% of the values')


The values within 1 standard deviation constitute 68.42% of the values
The values within 2 standard deviation constitute 95.41% of the values
The values within 3 standard deviation constitute 99.73% of the values


2) Calculate the mean of numbers greater than s.mean(). Then calculate the mean of numbers less than s.mean(). Is the average of these two numbers the same as s.mean()?

In [22]:
# Set the random seed for reproducibility
np.random.seed(1984)

# Calculate the mean of the entire series
overall_mean = s.mean()

# Calculate the mean of values below the overall mean
lower_mean = s.loc[s < overall_mean].mean()

# Calculate the mean of values above the overall mean
upper_mean = s.loc[s > overall_mean].mean()

# Calculate the average of the lower and upper means
average_of_means = (lower_mean + upper_mean) / 2

# Print the average of the means and the overall mean
print(f'Average of lower and upper means: {average_of_means}')
print(f'Overall mean: {overall_mean}')

# Note: The average of the lower and upper means may be close to the overall mean,
# but they may not be exactly the same due to floating point arithmetic. Ideally,
# they should be the same since the distribution is perfectly symmetrical.


Average of lower and upper means: 0.008196652647868063
Overall mean: -0.02695578510417541


3) What is the mean of the numbers beyond 3 standard deviations?

In [23]:
# Set the random seed for reproducibility
np.random.seed(1984)

# Calculate the mean and standard deviation of the series
mean = s.mean()
std = s.std()

# Calculate the mean of values that are outside three standard deviations from the mean
# Correcting the condition to check for values greater than or less than the mean ± 3*std
outlier_mean = s.loc[(s > mean + 3 * std) | (s < mean - 3 * std)].mean()

# Print the calculated mean of outliers
print(f'Mean of values outside three standard deviations: {outlier_mean}')

Mean of values outside three standard deviations: -23.52583375799195


1.6 Exercise 5: Monday temperatures

In this exercise, I want you to create a series of 28 temperature readings in Celsius, representing four seven-day weeks, randomly selected from a normal distribution with a mean of 20 and a standard deviation of 5, rounded to the nearest integer. (If you’re in a country that measures temperature in Fahrenheit, then just pretend you’re looking at the weather in exotic foreign location, rather than where you live.) The index should start with Sun, continue through Sat, and then repeat Sun through Sat until each temperature has a value.

The question is: What was the mean temperature on Mondays during this period?

In [24]:
# Set the random seed for reproducibility
np.random.seed(1984)

# Create a list of days of the week, repeated for 4 weeks
days_of_week = 'Sun Mon Tue Wed Thu Fri Sat'.split()
index = days_of_week * 4

# Generate a Series of normally distributed random numbers
# Mean = 20, Standard Deviation = 5, Number of samples = 28
random_values = np.random.normal(loc=20, scale=5, size=28).round()

# Create a pandas Series with the generated random values and specified index
# using int8 to cast the rounded float values to integer 
ser = pd.Series(random_values, index=index, dtype='int8')

# Display the first few entries of the Series
print("First few entries of the Series:")
print(ser.head())

# Calculate the mean of the values corresponding to 'Mon'
mean_monday_value = ser.loc['Mon'].mean()

# Display the mean value for 'Mon'
print(f"Mean value for 'Mon': {mean_monday_value}")

First few entries of the Series:
Sun    20
Mon    17
Tue    24
Wed    17
Thu    25
dtype: int8
Mean value for 'Mon': 20.75


1.6.3 Beyond the exercise
1) What was the average temperature on weekends (i.e., Saturdays and Sundays)? 

In [25]:
# Calculate the average temperature for the weekend (Saturday and Sunday)
avg_temp_weekend = ser.loc[['Sat', 'Sun']].mean()

# Display the average temperature for the weekend
print(f"Average temperature for the weekend (Sat & Sun): {avg_temp_weekend}")

Average temperature for the weekend (Sat & Sun): 21.25


2) How many times will the change in temperature from the previous day be greater than 2 degrees? 

In [26]:
# Calculate the difference in temperature between consecutive days
# .dropna() removes any resulting NaN values that occur due to the shift operation (the first entry will have no previous day to compare to).
diff = (ser - ser.shift(1)).dropna()

# Count the number of instances where the difference exceeds 2 degrees
count_exceeding_diff = (diff > 2).sum()

# Display the count of instances with a difference greater than 2 degrees
print(f"Number of instances where the temperature difference exceeds 2 degrees: {count_exceeding_diff}")


Number of instances where the temperature difference exceeds 2 degrees: 11



3) What are the two most common temperatures in our data set, and how often does each appear?

In [27]:
# Count the occurrences of each temperature value in the Series
value_counts = ser.value_counts()

# Get the two most common temperature values
top_two_values = value_counts.nlargest(2).index.to_numpy()

# Display the two most common temperature values
print(f"The two most common temperature values are: {top_two_values}")

The two most common temperature values are: [17 24]


1.7 Exercise 6: Passenger frequency

The data we’ll look at is in the file taxi-passenger-count.csv, available along with the other data files used in this course. The data comes from 2015 data I retrieved from New York City’s open data site, from which you can get enormous amounts of information about taxi rides in New York city over the last few years. This file shows the number of passengers in each of 100,000 rides.

Your task in this exercise is to show what percentage of taxi rides had only 1 passenger, vs. the (theoretical) maximum of 6 passengers.

In [28]:
# Read the CSV file containing taxi passenger counts
# The header is set to None since the file does not contain a header row
ser = pd.read_csv('data/taxi-passenger-count.csv', header=None).squeeze()

# Display the first few entries of the Series
print("First few entries of the Series:")
print(ser.head())

# Calculate the percentage of rides with exactly one passenger
perc_one_passenger = (ser == 1).mean().round(3)

# Calculate the percentage of rides with exactly six passengers
perc_six_passenger = (ser == 6).mean().round(3)

# Display the calculated percentages
print(f"Percentage of rides with one passenger: {perc_one_passenger}")
print(f"Percentage of rides with six passengers: {perc_six_passenger}")


First few entries of the Series:
0    1
1    1
2    1
3    1
4    1
Name: 0, dtype: int64
Percentage of rides with one passenger: 0.721
Percentage of rides with six passengers: 0.037


1.7.3 Beyond the exercise Let’s analyze our taxi passenger data in a few more ways:

1) What are the 25%, 50% (median), and 75% quantiles for this data set?

In [29]:
# Calculate the 25th, 50th, and 75th percentiles (quartiles)
quartiles = ser.quantile([0.25, 0.5, 0.75])

# Display the calculated quartiles
print("Quartiles:")
print(quartiles)


Quartiles:
0.25    1.0
0.50    1.0
0.75    2.0
Name: 0, dtype: float64


2) Can you guess the results before you execute the code? 

In [30]:
# If we check the previous question we can see that 72% of the data 
# consists of only 1 passenger. So we can guess the answer

3) What proportion of taxi rides are for 3, 4, 5, or 6 passengers? 

In [31]:
# Calculate the normalized value(proportions) counts of the Series
normalized_counts = ser.value_counts(normalize=True)

# Select specific indices (3, 4, 5, 6) from the normalized counts
selected_counts = normalized_counts.loc[[3, 4, 5, 6]]

# Display the selected normalized counts
print(selected_counts)


0
3    0.040604
4    0.018202
5    0.052005
6    0.036904
Name: proportion, dtype: float64


4) Consider that you’re in charge of vehicle licensing for New York taxis. Given these numbers, would more people benefit from smaller taxis that can take only one or two passengers, or larger taxis that can take five or six passengers?

In [32]:
# Smaller Taxis would be more beneficial because most of the Taxis carry 
# only one or two passengers.

1.8 Exercise 7: Long, medium, and short taxi rides

show the number of rides in each of three categories:
short, < = 2 miles
medium, > 2 miles, but < = 10 miles long, > 10 miles

In [33]:
# Load the dataset from a CSV file into a pandas Series
ser = pd.read_csv('data/taxi-distance.csv', header=None).squeeze()

# Display the first few entries of the Series to understand the data
print(ser.head())

# Define bins for categorizing the distances
bins = [-np.inf, 2, 10, np.inf]  # Define the bin edges
labels = ['short', 'medium', 'long']  # Define the corresponding labels for the bins

# Categorize the distances into 'short', 'medium', and 'long'
categories = pd.cut(ser, bins=bins, labels=labels)

# Count the occurrences of each category
category_counts = categories.value_counts()

# Display the counts of each category
print(category_counts)


0    1.63
1    0.46
2    0.87
3    2.13
4    1.40
Name: 0, dtype: float64
0
short     5890
medium    3402
long       707
Name: count, dtype: int64


1.8.3 Beyond the exercise

1) Compare the mean and median trip distances. What does that tell you about the distribution of our data?

In [34]:
# Calculate descriptive statistics for the Series
description = ser.describe()

# Extract the mean and median (50th percentile) from the descriptive statistics
mean_value = description['mean']
median_value = description['50%']

# Print the mean and median values
print(f"Mean: {mean_value}, Median: {median_value}")

# Interpretation of the results
if mean_value > median_value:
    print("The mean is greater than the median, indicating the presence of outliers.")
    print("This suggests a right-skewed distribution.")
else:
    print("The mean is less than or equal to the median, indicating a more symmetric distribution.")


Mean: 3.1585108510851083, Median: 1.7
The mean is greater than the median, indicating the presence of outliers.
This suggests a right-skewed distribution.


2) How many short, medium, and long trips were there for trips that had only one passenger? Note that data for passenger count and trip length are from the same data set, meaning that the indexes are the same. 

In [35]:
# Load the datasets from CSV files into pandas Series
s1 = pd.read_csv('data/taxi-passenger-count.csv', header=None).squeeze()  # Passenger count
s2 = pd.read_csv('data/taxi-distance.csv', header=None).squeeze()  # Trip distance

# Create a DataFrame combining passenger counts, distances, and trip categories
df = pd.DataFrame({'passengers': s1, 'distance': s2, 'trip': categories})

# Display the first few entries of the DataFrame
print(df.head())

# Count the occurrences of trip categories for trips with exactly 1 passenger
trip_counts_for_one_passenger = df[df.passengers == 1].trip.value_counts()

# Display the counts of trip categories for trips with 1 passenger
print(trip_counts_for_one_passenger)


   passengers  distance    trip
0           1      1.63   short
1           1      0.46   short
2           1      0.87   short
3           1      2.13  medium
4           1      1.40   short
trip
short     4333
medium    2387
long       487
Name: count, dtype: int64


3) What happens if we don’t pass explicit intervals, and instead ask pd.cut to just create 3 bins, with bins=3?

In [36]:
# Create three equal-length bins for the passenger counts
# The bins will automatically be determined based on the range of the data
binned_passengers = pd.cut(s1, bins=3)

# Count the occurrences in each bin
bin_counts = binned_passengers.value_counts()

# Display the counts of each bin
print(bin_counts)


0
(-0.006, 2.0]    8522
(4.0, 6.0]        889
(2.0, 4.0]        588
Name: count, dtype: int64
