In [None]:
Q1. Load the flight price dataset and examine its dimensions. How many rows and columns does the
dataset have?

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('flight_price_dataset.csv')

# Check dimensions
dimensions = df.shape
print(f'The dataset has {dimensions[0]} rows and {dimensions[1]} columns.')


In [None]:
Q2. What is the distribution of flight prices in the dataset? Create a histogram to visualize the
distribution.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Load the dataset
df = pd.read_csv('flight_price_dataset.csv')


# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create the histogram
plt.figure(figsize=(10, 6))
sns.histplot(df['price'], bins=30, kde=True)  # Adjust bins as necessary
plt.title('Distribution of Flight Prices')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.xlim(0, df['price'].max())  # Limit x-axis to a sensible range
plt.grid(axis='y', alpha=0.75)
plt.show()


In [None]:
Q3. What is the range of prices in the dataset? What is the minimum and maximum price?

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('flight_price_dataset.csv')


# Assuming the price column is named 'price'
min_price = df['price'].min()
max_price = df['price'].max()

# Calculate the range
price_range = max_price - min_price

print(f'Minimum Price: {min_price}')
print(f'Maximum Price: {max_price}')
print(f'Price Range: {price_range}')

In [None]:
Q4. How does the price of flights vary by airline? Create a boxplot to compare the prices of different
airlines.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Load the dataset
df = pd.read_csv('flight_price_dataset.csv')

# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create the boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(x='airline', y='price', data=df)
plt.title('Flight Prices by Airline')
plt.xlabel('Airline')
plt.ylabel('Price')
plt.xticks(rotation=45)  # Rotate x labels for better readability
plt.grid(axis='y', alpha=0.75)
plt.show()


In [None]:
Q5. Are there any outliers in the dataset? Identify any potential outliers using a boxplot and describe how
they may impact your analysis.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Load the dataset
df = pd.read_csv('flight_price_dataset.csv')

# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create the boxplot
plt.figure(figsize=(12, 6))
sns.boxplot(x='airline', y='price', data=df)
plt.title('Flight Prices by Airline with Outlier Detection')
plt.xlabel('Airline')
plt.ylabel('Price')
plt.xticks(rotation=45)  # Rotate x labels for better readability
plt.grid(axis='y', alpha=0.75)
plt.show()

In [None]:
###Impact of Outliers on Analysis
1. Influence on Statistical Measures:
Outliers can skew mean values, making them unrepresentative of the general data. Medians are less affected and may
provide a more reliable measure of central tendency.
2. Modeling Challenges:
In regression models, outliers can disproportionately influence the fit, leading to poor predictive performance.
They can also affect assumptions of normality and homoscedasticity.
3. Data Interpretation:
Outliers may indicate anomalies, such as pricing errors or unique pricing strategies. Understanding the context of 
these outliers is essential for making informed decisions.
4. Data Cleaning Decisions:
Depending on the analysis goals, you might choose to remove outliers, adjust them, or keep them for further 
investigation.

In [None]:
Q6. You are working for a travel agency, and your boss has asked you to analyze the Flight Price dataset
to identify the peak travel season. What features would you analyze to identify the peak season, and how
would you present your findings to your boss?

In [None]:
##Key Features to Analyze
1. Travel Date:
Departure Date: This is crucial for identifying trends over time, including seasonal patterns.
Return Date: If available, it may help in understanding travel trends.

2. Day of the Week:
Analyze whether certain days (e.g., weekends vs. weekdays) have different pricing patterns.

3. Month:
Monthly aggregation can reveal trends that indicate peak travel months.

4. Seasonal Indicators:
Consider holidays, school vacations, and other events that typically drive travel demand.

5. Price Variations:
Analyzing how prices fluctuate over different periods can indicate demand peaks.

6. Flight Capacity or Availability (if available):
Data on the number of flights or available seats can help correlate demand with prices.

In [None]:
df['departure_date'] = pd.to_datetime(df['departure_date'])
df['month'] = df['departure_date'].dt.month
df['day_of_week'] = df['departure_date'].dt.day_name()


monthly_avg_price = df.groupby('month')['price'].mean().reset_index()
weekday_avg_price = df.groupby('day_of_week')['price'].mean().reset_index()


import matplotlib.pyplot as plt
import seaborn as sns

# Monthly average price plot
plt.figure(figsize=(12, 6))
sns.barplot(x='month', y='price', data=monthly_avg_price)
plt.title('Average Flight Prices by Month')
plt.xlabel('Month')
plt.ylabel('Average Price')
plt.xticks(ticks=range(12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.show()

# Day of the week average price plot
plt.figure(figsize=(12, 6))
sns.barplot(x='day_of_week', y='price', data=weekday_avg_price)
plt.title('Average Flight Prices by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Price')
plt.xticks(rotation=45)
plt.show()

In [None]:
###Presenting Findings
1. Create a Report:
        Summarize your findings in a clear report. Include the following sections:
            Introduction: Purpose of the analysis and the importance of identifying peak travel seasons.
            Methodology: Describe the features analyzed and the approach used.
            Findings: Present visualizations and key insights, such as:
                Which months have the highest average prices (indicating peak seasons).
                Which days of the week tend to have higher prices (e.g., higher prices on Fridays and Sundays).
            Recommendations: Based on the analysis, suggest potential strategies for the agency, such as promotional 
                offers during low seasons or targeting marketing efforts during peak times.
2. Use Visual Aids:
        Incorporate the visualizations directly into your presentation or report to make it easier for your boss to 
        grasp the insights quickly.

3. Actionable Insights:
        Highlight actionable insights, such as adjusting pricing strategies, optimizing marketing efforts around peak 
        seasons, or advising customers on the best times to book.

In [None]:
Q7. You are a data analyst for a flight booking website, and you have been asked to analyze the Flight
Price dataset to identify any trends in flight prices. What features would you analyze to identify these
trends, and what visualizations would you use to present your findings to your team?

In [None]:
###Key Features to Analyze

1. Departure Date:
        Analyze how flight prices change over time, particularly across different months and seasons.

2. Day of the Week:
        Investigate if there are significant price differences based on the day flights are booked or traveled.

3. Airline:
        Compare prices across different airlines to see if there are pricing trends related to specific carriers.

4. Flight Duration:
        Explore how the duration of the flight correlates with pricing.

5. Class of Service:
        Analyze how prices differ by class (economy, business, first class).

6. Distance:
        Examine how flight distance affects pricing trends.

7. Booking Time in Advance:
        Look at how prices vary depending on how far in advance the flights are booked.

In [None]:
df['departure_date'] = pd.to_datetime(df['departure_date'])
df['month'] = df['departure_date'].dt.month
df['day_of_week'] = df['departure_date'].dt.day_name()


monthly_avg_price = df.groupby('month')['price'].mean().reset_index()
weekday_avg_price = df.groupby('day_of_week')['price'].mean().reset_index()
airline_avg_price = df.groupby('airline')['price'].mean().reset_index()


import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.lineplot(x='month', y='price', data=monthly_avg_price, marker='o')
plt.title('Average Flight Prices by Month')
plt.xlabel('Month')
plt.ylabel('Average Price')
plt.xticks(ticks=range(12), labels=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.grid()
plt.show()


plt.figure(figsize=(12, 6))
sns.barplot(x='day_of_week', y='price', data=weekday_avg_price)
plt.title('Average Flight Prices by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Price')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.75)
plt.show()


plt.figure(figsize=(12, 6))
sns.boxplot(x='airline', y='price', data=df)
plt.title('Flight Prices by Airline')
plt.xlabel('Airline')
plt.ylabel('Price')
plt.xticks(rotation=45)
plt.grid(axis='y', alpha=0.75)
plt.show()


plt.figure(figsize=(12, 6))
sns.scatterplot(x='distance', y='price', data=df)
plt.title('Flight Price vs. Distance')
plt.xlabel('Distance (miles)')
plt.ylabel('Price')
plt.grid()
plt.show()

In [None]:
Q8. You are a data scientist working for an airline company, and you have been asked to analyze the
Flight Price dataset to identify the factors that affect flight prices. What features would you analyze to
identify these factors, and how would you present your findings to the management team?

In [None]:
###Key Features to Analyze
1. Departure Date:
        Analyze how flight prices change over different times of the year, including peak travel seasons.

2. Day of the Week:
        Investigate pricing variations based on whether flights are scheduled on weekends or weekdays.

3. Airline:
        Compare pricing strategies across different airlines to understand how carrier choice impacts pricing.

4. Flight Duration:
        Examine how the length of the flight correlates with price, as longer flights may have higher costs.

5. Distance:
        Analyze the relationship between the distance traveled and the ticket price.

6. Class of Service:
        Consider how pricing differs between economy, business, and first-class tickets.

7. Booking Time in Advance:
        Analyze how prices change based on how far in advance customers book their flights.

8. Number of Stops:
        Investigate whether direct flights are priced differently compared to those with one or more stops.

In [None]:
import pandas as pd

df = pd.read_csv('flight_price_dataset.csv')
df['departure_date'] = pd.to_datetime(df['departure_date'])
df['month'] = df['departure_date'].dt.month
df['day_of_week'] = df['departure_date'].dt.day_name()


import statsmodels.api as sm
import statsmodels.formula.api as smf

model = smf.ols('price ~ airline + month + day_of_week + duration + distance + class_of_service + booking_time + num_stops', data=df).fit()
print(model.summary())


import seaborn as sns
import matplotlib.pyplot as plt

# Boxplot for airline vs. price
plt.figure(figsize=(12, 6))
sns.boxplot(x='airline', y='price', data=df)
plt.title('Flight Prices by Airline')
plt.xticks(rotation=45)
plt.show()

# Scatter plot for distance vs. price
plt.figure(figsize=(12, 6))
sns.scatterplot(x='distance', y='price', data=df)
plt.title('Flight Price vs. Distance')
plt.xlabel('Distance (miles)')
plt.ylabel('Price')
plt.grid()
plt.show()

In [None]:
###Presenting Findings
1. Create a Report:
        Summarize your findings in a well-structured report or presentation. Key sections to include:
            Introduction: Briefly explain the purpose of the analysis and its importance for the airline.
            Methodology: Describe the features analyzed and the statistical methods used.
            Findings: Present key insights:
                Highlight which features have the strongest impact on flight prices.
                Include visualizations to illustrate your points clearly.
            Recommendations: Based on your analysis, suggest actionable strategies for pricing, marketing, and revenue 
                management.

2. Use Visual Aids:
        Incorporate visualizations directly into your presentation to make the data more accessible. Use graphs and 
        charts to reinforce your findings.

3. Actionable Insights:
        Focus on actionable insights. For example, if you find that flights booked several months in advance are 
        significantly cheaper, recommend a marketing campaign targeting early bookings.

In [None]:
Q9. Load the Google Playstore dataset and examine its dimensions. How many rows and columns does
the dataset have?

In [None]:
import pandas as pd


# Load the dataset
df = pd.read_csv('google_playstore.csv')


# Check dimensions
dimensions = df.shape
print(f'The dataset has {dimensions[0]} rows and {dimensions[1]} columns.')

In [None]:
Q10. How does the rating of apps vary by category? Create a boxplot to compare the ratings of different
app categories.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Load the dataset
df = pd.read_csv('google_playstore.csv')


# Convert ratings to numeric, coerce errors to handle non-numeric values
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Drop rows with missing ratings
df = df.dropna(subset=['Rating'])


# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create the boxplot
plt.figure(figsize=(15, 8))
sns.boxplot(x='Category', y='Rating', data=df)
plt.title('App Ratings by Category')
plt.xlabel('Category')
plt.ylabel('Rating')
plt.xticks(rotation=45)  # Rotate x labels for better readability
plt.grid(axis='y', alpha=0.75)
plt.show()

In [None]:
Q11. Are there any missing values in the dataset? Identify any missing values and describe how they may
impact your analysis.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('google_playstore.csv')


# Check for missing values
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])  # Display only columns with missing values


In [None]:
###Impact of Missing Values

1. Data Integrity:
        Missing values can compromise the integrity of your analysis. For instance, if ratings are missing, you won't 
        be able to accurately assess app quality across categories.

2. Statistical Bias:
        If the missing data is not random, it can introduce bias. For example, if higher-rated apps are more likely to 
        have missing data, your analysis might underestimate app quality.

3. Model Performance:
        In predictive modeling, missing values can lead to inaccurate model predictions. Most algorithms require 
        complete datasets, and missing values can result in errors or require imputation strategies that may distort 
        the data.

4. Imputation Decisions:
        Depending on the percentage of missing values, you may choose to:
            Drop rows with missing values (if they are few).
            Impute missing values using methods like mean, median, or mode.
            Analyze missingness patterns to understand if the missing data is systematic.

In [None]:
Q12. What is the relationship between the size of an app and its rating? Create a scatter plot to visualize
the relationship.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Load the dataset
df = pd.read_csv('google_playstore.csv')


# Convert 'Size' to numeric, removing 'M' or 'k' suffixes
def convert_size(size):
    if 'M' in size:
        return float(size.replace('M', '').strip()) * 1024  # Convert MB to KB
    elif 'k' in size:
        return float(size.replace('k', '').strip())
    return None

df['Size'] = df['Size'].apply(convert_size)

# Convert 'Rating' to numeric
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Drop rows with missing values in 'Size' or 'Rating'
df = df.dropna(subset=['Size', 'Rating'])


# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create the scatter plot
plt.figure(figsize=(12, 6))
sns.scatterplot(x='Size', y='Rating', data=df, alpha=0.6)
plt.title('Relationship Between App Size and Rating')
plt.xlabel('Size (KB)')
plt.ylabel('Rating')
plt.xscale('log')  # Optional: log scale for better visualization if sizes vary greatly
plt.grid()
plt.show()


In [None]:
Q13. How does the type of app affect its price? Create a bar chart to compare average prices by app type.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Load the dataset
df = pd.read_csv('google_playstore.csv')


# Convert 'Price' to numeric, removing the '$' sign and any other non-numeric characters
df['Price'] = df['Price'].replace('[\$,]', '', regex=True).astype(float)

# Drop rows with missing values in 'Price' or 'Category'
df = df.dropna(subset=['Price', 'Category'])


average_prices = df.groupby('Category')['Price'].mean().reset_index()


# Set the aesthetic style of the plots
sns.set(style="whitegrid")

# Create the bar chart
plt.figure(figsize=(15, 8))
sns.barplot(x='Price', y='Category', data=average_prices.sort_values('Price', ascending=False))
plt.title('Average Prices by App Category')
plt.xlabel('Average Price ($)')
plt.ylabel('App Category')
plt.grid(axis='x', alpha=0.75)
plt.show()


In [None]:
Q14. What are the top 10 most popular apps in the dataset? Create a frequency table to identify the apps
with the highest number of installs.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('google_playstore.csv')

# Clean the 'Installs' column by removing '+' and ',' and convert to numeric
df['Installs'] = df['Installs'].replace('[+,]', '', regex=True).astype(int)

# Drop rows with missing values in 'Installs' or 'App'
df = df.dropna(subset=['Installs', 'App'])

top_apps = df[['App', 'Installs']].sort_values(by='Installs', ascending=False).head(10)

print(top_apps)


In [None]:
Q15. A company wants to launch a new app on the Google Playstore and has asked you to analyze the
Google Playstore dataset to identify the most popular app categories. How would you approach this
task, and what features would you analyze to make recommendations to the company?

In [None]:
import pandas as pd

df = pd.read_csv('google_playstore.csv')
print(df.head())
print(df.info())


# Clean the 'Installs' column
df['Installs'] = df['Installs'].replace('[+,]', '', regex=True).astype(int)

# Convert 'Rating' to numeric
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')

# Drop rows with missing values in essential columns
df = df.dropna(subset=['Category', 'Installs', 'Rating'])


category_analysis = df.groupby('Category').agg({
    'Installs': 'sum',
    'Rating': 'mean',
    'App': 'count'
}).reset_index().rename(columns={'App': 'App Count'})


import matplotlib.pyplot as plt
import seaborn as sns

# Bar plot for total installs by category
plt.figure(figsize=(15, 8))
sns.barplot(x='Installs', y='Category', data=category_analysis.sort_values('Installs', ascending=False))
plt.title('Total Installs by App Category')
plt.xlabel('Total Installs')
plt.ylabel('App Category')
plt.show()

# Bar plot for average rating by category
plt.figure(figsize=(15, 8))
sns.barplot(x='Rating', y='Category', data=category_analysis.sort_values('Rating', ascending=False))
plt.title('Average Rating by App Category')
plt.xlabel('Average Rating')
plt.ylabel('App Category')
plt.show()




In [None]:
Q16. A mobile app development company wants to analyze the Google Playstore dataset to identify the
most successful app developers. What features would you analyze to make recommendations to the
company, and what data visualizations would you use to present your findings?

In [None]:
###Key Features to Analyze

1. Developer Name:
    Identify the developers of each app to aggregate their performance.

2. Rating:
    Analyze the average rating of apps developed by each developer, as higher ratings often indicate success.

3. Installs:
    Examine the total number of installs for each developer's apps to measure popularity and reach.

4. Number of Apps:
    Count how many apps each developer has on the Playstore. A higher number of apps can indicate a more established 
    developer.

5. Average Price:
    Consider the pricing strategy of developers. Analyzing the average price of paid apps can provide insights into 
    revenue potential.

6. Category:
    Identify which categories the developers are focusing on, as some categories may be more lucrative than others.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('google_playstore.csv')

# Clean the dataset
df['Installs'] = df['Installs'].replace('[+,]', '', regex=True).astype(int)
df['Rating'] = pd.to_numeric(df['Rating'], errors='coerce')
df['Price'] = df['Price'].replace('[\$,]', '', regex=True).astype(float)
df.dropna(subset=['Developer', 'Installs', 'Rating'], inplace=True)


developer_analysis = df.groupby('Developer').agg({
    'Rating': 'mean',
    'Installs': 'sum',
    'App': 'count',
    'Price': 'mean'
}).reset_index().rename(columns={'App': 'Number of Apps'})


top_developers = developer_analysis.sort_values(by='Installs', ascending=False).head(10)


import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.barplot(x='Installs', y='Developer', data=top_developers)
plt.title('Top Developers by Total Installs')
plt.xlabel('Total Installs')
plt.ylabel('Developer')
plt.show()


plt.figure(figsize=(12, 6))
sns.barplot(x='Rating', y='Developer', data=top_developers)
plt.title('Top Developers by Average Rating')
plt.xlabel('Average Rating')
plt.ylabel('Developer')
plt.show()


plt.figure(figsize=(12, 6))
sns.scatterplot(x='Installs', y='Rating', data=developer_analysis)
plt.title('Installs vs. Rating for Developers')
plt.xlabel('Total Installs')
plt.ylabel('Average Rating')
plt.grid()
plt.show()


In [None]:
Q17. A marketing research firm wants to analyze the Google Playstore dataset to identify the best time to
launch a new app. What features would you analyze to make recommendations to the company, and
what data visualizations would you use to present your findings?

In [None]:
###Key Features to Analyze
1. Release Date:
    Analyze the release dates of existing apps to identify any seasonal trends or patterns in app launches.

2. Category:
    Different app categories may perform better during specific times of the year (e.g., gaming apps during holidays).

3. Installs Over Time:
    Examine how the number of installs varies over time to identify peak periods for app downloads.

4. Ratings Over Time:
    Analyze how average ratings change over time after an app's launch to see if certain times yield higher initial 
    ratings.

5. User Reviews:
    Consider the volume and sentiment of user reviews to identify trends or events that might influence app success.

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('google_playstore.csv')

# Clean and format the 'Last Updated' date
df['Last Updated'] = pd.to_datetime(df['Last Updated'], errors='coerce')


df['Release Month'] = df['Last Updated'].dt.month_name()


installs_by_month = df.groupby('Release Month')['Installs'].sum().reset_index()
installs_by_month.sort_values('Release Month', inplace=True)


ratings_by_month = df.groupby('Release Month')['Rating'].mean().reset_index()
ratings_by_month.sort_values('Release Month', inplace=True)


import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12, 6))
sns.barplot(x='Release Month', y='Installs', data=installs_by_month)
plt.title('Total Installs by Month')
plt.xlabel('Month')
plt.ylabel('Total Installs')
plt.xticks(rotation=45)
plt.show()


plt.figure(figsize=(12, 6))
sns.barplot(x='Release Month', y='Rating', data=ratings_by_month)
plt.title('Average Rating by Month')
plt.xlabel('Month')
plt.ylabel('Average Rating')
plt.xticks(rotation=45)
plt.show()


# Assuming you have a time series data of installs
plt.figure(figsize=(12, 6))
# Use the actual time series DataFrame instead of dummy data here
sns.lineplot(x='Date', y='Installs', data=your_time_series_data)
plt.title('Installs Over Time')
plt.xlabel('Date')
plt.ylabel('Total Installs')
plt.grid()
plt.show()
