#### There are several types of analysis in data analysis, including:

#### Descriptive analysis: This involves summarizing and describing the characteristics of a dataset, such as mean, median, mode, standard deviation, and frequency distribution.

#### Inferential analysis: This involves using statistical techniques to draw conclusions about a population based on a sample of data.

#### Predictive analysis: This involves using data to make predictions about future events or trends, such as forecasting sales for the next quarter.

#### Prescriptive analysis: This involves using data to identify the best course of action or decision to take, such as optimizing a supply chain to reduce costs.

#### Diagnostic analysis: This involves identifying the cause of a particular outcome or problem, such as determining why sales have decreased in a particular region.

#### Exploratory analysis: This involves investigating a dataset to discover patterns, relationships, and insights that may not be immediately apparent, such as using clustering techniques to identify customer segments.

#### Textual analysis: This involves analyzing unstructured data such as text, using techniques such as natural language processing to extract meaning and insights

# Descriptive analysis

### Descriptive analysis is a type of data analysis that involves summarizing and describing the characteristics of a dataset. This can include measures of central tendency (e.g., mean, median, mode) and measures of variability (e.g., standard deviation, range).

In [9]:
import pandas as pd

# Load the dataset into a pandas dataframe
df = pd.read_csv('example_dataset.csv')

# Print the first five rows of the dataset
print(df.head())

# Calculate the mean, median, and mode of a numerical variable in the dataset
print("Mean:", df['numerical_variable'].mean())
print("Median:", df['numerical_variable'].median())
print("Mode:", df['numerical_variable'].mode())

# Calculate the standard deviation and range of the numerical variable
print("Standard Deviation:", df['numerical_variable'].std())
print("Range:", df['numerical_variable'].max() - df['numerical_variable'].min())


Mean: 14.127291739894563
Median: 13.37
Mode: 0    12.34
Name: radius_mean, dtype: float64
Standard Deviation: 3.524048826212078
Range: 21.128999999999998


#### In this example code, we first load a dataset into a pandas dataframe using the read_csv function. We then use the head function to print the first five rows of the dataset to get a quick overview of the data.

#### Next, we calculate several measures of central tendency and variability for a numerical variable in the dataset (numerical_variable). We use the mean, median, and mode functions to calculate the mean, median, and mode of the variable, respectively. We then use the std function to calculate the standard deviation of the variable, and the max and min functions to calculate the range of the variable. These measures can provide insights into the distribution and characteristics of the variable.

#  Inferential analysis

### Inferential analysis is a type of statistical analysis that involves drawing conclusions about a population based on a sample of data. Inferential analysis uses statistical techniques to make inferences about a population based on the characteristics of a sample, such as confidence intervals and hypothesis testing.

In [13]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

# Load the dataset into a pandas dataframe
df = pd.read_csv('example_dataset.csv')

# Calculate the mean of a numerical variable in the dataset
sample_mean = df['numerical_variable'].mean()

# Calculate the standard error of the mean
standard_error = df['numerical_variable'].std() / np.sqrt(len(df))

# Calculate a 95% confidence interval for the population mean
lower_bound = sample_mean - 1.96 * standard_error
upper_bound = sample_mean + 1.96 * standard_error
print("95% Confidence Interval: [{}, {}]".format(lower_bound, upper_bound))

# Test whether the mean of a numerical variable is significantly different between two groups
group1 = df[df['group'] == 'A']['numerical_variable']
group2 = df[df['group'] == 'B']['numerical_variable']
t_stat, p_val = ttest_ind(group1, group2, equal_var=False)
print("t-statistic: ", t_stat)
print("p-value: ", p_val)


95% Confidence Interval: [13.837729548191374, 14.416853931597752]
t-statistic:  nan
p-value:  nan


#### In this example code, we first load a dataset into a pandas dataframe using the read_csv function. We then calculate the sample mean and standard error of a numerical variable (numerical_variable) in the dataset.

#### Next, we use these values to calculate a 95% confidence interval for the population mean of the variable. This interval provides an estimate of the range of values within which we can be reasonably confident that the true population mean lies.

#### Finally, we test whether the mean of the numerical variable is significantly different between two groups (group1 and group2) using a t-test. The ttest_ind function from the scipy.stats library calculates the t-statistic and p-value for this test. The p-value indicates the probability of observing a difference in means as large as or larger than the one observed, assuming that the null hypothesis (that the means are equal) is true. If the p-value is below a predetermined significance level (e.g., 0.05), we reject the null hypothesis and conclude that the means are significantly different.

# Predictive analysis

### Predictive analysis is a type of data analysis that involves using data to make predictions about future events or trends. Predictive analysis uses techniques such as regression analysis, time series analysis, and machine learning to develop models that can forecast future outcomes based on historical data.

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the dataset into a pandas dataframe
df = pd.read_csv('df.csv')

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df[['feature1', 'feature2']], df['target_variable'], test_size=0.2)

# Train a linear regression model on the training set
model = LinearRegression()
model.fit(X_train, y_train)

# Use the trained model to make predictions on the testing set
y_pred = model.predict(X_test)

# Evaluate the performance of the model using mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error: ", mse)

#### In this example code, we first load a dataset into a pandas dataframe using the read_csv function. We then split the dataset into training and testing sets using the train_test_split function from scikit-learn.

#### Next, we train a linear regression model on the training set using the LinearRegression class from scikit-learn. We then use the trained model to make predictions on the testing set using the predict method.

#### Finally, we evaluate the performance of the model on the testing set using mean squared error (MSE). The mean_squared_error function from scikit-learn calculates the MSE between the predicted and actual values of the target variable. The MSE is a measure of the average squared difference between the predicted and actual values, and a lower MSE indicates better predictive performance.

# Prescriptive analysis

### Prescriptive analysis is a type of data analysis that involves using data and models to make recommendations for actions that will optimize a particular outcome. Prescriptive analysis builds on predictive analysis by incorporating constraints and objectives to identify the best course of action given the available options.

In [None]:
from pulp import *

# Create a linear optimization problem
prob = LpProblem("Example Problem", LpMaximize)

# Define the decision variables
x = LpVariable("x", 0, None)
y = LpVariable("y", 0, None)

# Define the objective function to be maximized
prob += 2*x + 3*y

# Add constraints
prob += x + y <= 4
prob += x <= 2
prob += y <= 3

# Solve the problem
prob.solve()

# Print the optimal values of the decision variables and the objective function
print("x: ", x.value())
print("y: ", y.value())
print("Optimal Objective Function Value: ", value(prob.objective))

#### In this example code, we first create a linear optimization problem using the LpProblem class from the PuLP library. We then define two decision variables, x and y, and an objective function to be maximized (in this case, 2*x + 3*y).

#### We then add three constraints to the problem using the += operator. These constraints limit the values of x and y and ensure that their sum is no greater than 4.

#### Finally, we solve the optimization problem using the solve method, and print the optimal values of the decision variables and the objective function using the value method. In this example, the optimal values are x=2 and y=2, and the optimal objective function value is 10. This indicates that the best course of action is to set x=2 and y=2 in order to maximize the objective function, subject to the given constraints.

#  Diagnostic analysis

### Diagnostic analysis is a type of data analysis that involves identifying the cause of a problem or issue by examining data and other information. Diagnostic analysis often involves visualizations and statistical techniques to identify patterns and trends in the data that can explain the issue.

In [None]:
import pandas as pd
import seaborn as sns

# Load the dataset into a pandas dataframe
df = pd.read_csv('example_dataset.csv')

# Compute summary statistics of the dataset
summary = df.describe()

# Create a pairplot of the dataset
sns.pairplot(df)

# Create a heatmap of the correlation matrix
corr_matrix = df.corr()
sns.heatmap(corr_matrix, cmap='coolwarm', annot=True)


#### In this example code, we first load a dataset into a pandas dataframe using the read_csv function. We then compute summary statistics of the dataset using the describe method, which calculates the count, mean, standard deviation, minimum, and maximum values of each column in the dataframe.

#### We then create a pairplot of the dataset using the pairplot function from the seaborn library. A pairplot is a scatterplot matrix that shows the relationship between every pair of variables in the dataset, and can be used to identify patterns and trends in the data.

#### Finally, we create a heatmap of the correlation matrix using the heatmap function from seaborn. The correlation matrix shows the correlation coefficient between every pair of variables in the dataset, and can be used to identify which variables are most strongly related to each other. In this example, we use a colormap called 'coolwarm' to indicate the strength and direction of the correlation coefficient, and we annotate each cell with the actual value of the correlation coefficient. This can help identify relationships that might be causing the issue or problem under investigation.

# Exploratory analysis

### Exploratory analysis is a type of data analysis that involves exploring and summarizing a dataset to gain insights and generate hypotheses about the relationships and patterns in the data. Exploratory analysis often involves data visualization and summary statistics.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset into a pandas dataframe
df = pd.read_csv('example_dataset.csv')

# Print the first five rows of the dataset
print(df.head())

# Compute summary statistics of the dataset
summary = df.describe()
print(summary)

# Create a histogram of one variable in the dataset
plt.hist(df['variable'], bins=20)
plt.xlabel('Variable Name')
plt.ylabel('Frequency')
plt.show()

# Create a scatterplot of two variables in the dataset
plt.scatter(df['variable1'], df['variable2'])
plt.xlabel('Variable 1')
plt.ylabel('Variable 2')
plt.show()


#### In this example code, we first load a dataset into a pandas dataframe using the read_csv function. We then print the first five rows of the dataset using the head method to get an idea of what the data looks like.

#### We then compute summary statistics of the dataset using the describe method, which calculates the count, mean, standard deviation, minimum, and maximum values of each column in the dataframe.

#### We then create a histogram of one variable in the dataset using the hist function from matplotlib. The histogram shows the distribution of values of that variable in the dataset, and can be used to identify any patterns or outliers.

#### Finally, we create a scatterplot of two variables in the dataset using the scatter function from matplotlib. The scatterplot shows the relationship between the two variables, and can be used to identify any trends or patterns in the data. In this example, we plot variable1 on the x-axis and variable2 on the y-axis.

# Textual analysis

### Textual analysis is a type of data analysis that involves analyzing written or spoken language to extract meaningful insights and patterns. This type of analysis is often used in fields such as linguistics, marketing, and social sciences to analyze written text or transcripts of conversations.

In [None]:
import nltk
from nltk.corpus import movie_reviews

# Load the movie reviews dataset
nltk.download('movie_reviews')
reviews = [(list(movie_reviews.words(fileid)), category)
           for category in movie_reviews.categories()
           for fileid in movie_reviews.fileids(category)]

# Print the first review in the dataset
print(reviews[0])

# Convert the reviews to lowercase and remove punctuation
reviews_cleaned = []
for words, category in reviews:
    words_cleaned = [word.lower() for word in words if word.isalpha()]
    reviews_cleaned.append((words_cleaned, category))

# Compute the frequency distribution of words in positive reviews
positive_reviews = [words for words, category in reviews_cleaned if category == 'pos']
positive_words = [word for review in positive_reviews for word in review]
positive_word_freq = nltk.FreqDist(positive_words)
print(positive_word_freq.most_common(10))

# Compute the frequency distribution of words in negative reviews
negative_reviews = [words for words, category in reviews_cleaned if category == 'neg']
negative_words = [word for review in negative_reviews for word in review]
negative_word_freq = nltk.FreqDist(negative_words)
print(negative_word_freq.most_common(10))


#### In this example code, we first load the movie reviews dataset from the NLTK corpus using the movie_reviews module. The dataset consists of 1000 positive and 1000 negative reviews of movies.

#### We then print the first review in the dataset to get an idea of what the data looks like.

#### Next, we clean the reviews by converting them to lowercase and removing punctuation using a for loop. We store the cleaned reviews in a new list called reviews_cleaned.

#### Finally, we compute the frequency distribution of words in positive and negative reviews using the FreqDist function from NLTK. We first extract all the words from the positive and negative reviews into separate lists called positive_words and negative_words. We then use the FreqDist function to count the frequency of each word in each list, and print the 10 most common words in each list. This can help identify which words are most frequently associated with positive and negative reviews, and can be used to generate insights about the dataset.