## Project Information

Data Analysis and Sampling on Imports-Exports Dataset | Mahi and Jagriti | 055023 and 055016 | Group Number - 29

## Description of Data

This dataset provides detailed information on international trade transactions, capturing both import and export activities. It includes comprehensive data on various aspects of trade, making it a valuable resource for business analysis, economic research, and financial modelling.

Description of Data:
This dataset contains import and export transaction records.

Data Source & Size:
Number of Observations: 15,000
Number of Variables: 16

Data Type:
Cross-sectional

Data Variable Types:
11 categorical (text) variables
3 integer variables
2 decimal variables

Variable Classification:
Index: Transaction_ID
Categorical (Nominal): Country, Product, Import_Export, Category, Port, Shipping_Method, Supplier, Customer, Payment_Terms
Non-Categorical: Quantity, Value, Date, Customs_Code, Weight, Invoice_Number

## Project Objectives

This project aims at conducting an in-depth exploratory analysis on the detailed Imports and Exports Dataset from Kaggle (https://www.kaggle.com/datasets/chakilamvishwas/imports-exports-15000) that contains comprehensive data about transactions. Therefore, such data shall lead to something highly valuable for making informed decisions. The project will identify key markets and pinpoint countries that hold significant trade opportunities, especially high-demand product categories like clothing, machinery, and electronics.

Market segmentation, involving analysis of transaction data to classify countries based on import-export activity, will help to highlight high-potential trade regions. The same trend analysis of volume and value over time will also help to reveal emerging markets and stable trading partners. The project measures the profitability of various product categories across different countries, ranking markets according to their yield potential.

These insights would be critical in supporting strategic decisions on the allocation of resources, marketing, and logistics, while facilitating risk analysis by understanding dynamics existing in international trade and economic conditions prevalent in key markets. This, in turn, enables businesses to strategically adapt their operations to maximize presence and performance in the most remunerative regions.

## Analysis of Data

The dataset comprises 2,001 transaction records detailing Transaction ID, Country, Product, Import/Export status, Quantity, Value, and more. Key analyses reveal significant export sectors such as Clothing, Electronics, and Machinery, with prominent markets like Burkina Faso and Argentina. Variations in shipping methods and payment terms, particularly "Prepaid" and "Net 60," reflect diverse financial strategies.

The average Quantity is approximately 4,954 units (standard deviation: 2,894), while the average Value is around $5,003, with similar variability. Coefficients of variation indicate substantial transaction variability: 58.41% for Quantity, 59.18% for Weight, and 57.57% for Value.

Statistical tests, including the Shapiro-Wilk test, show non-normal distributions for Quantity, Weight, and Value (p-values < 0.05), leading to the rejection of normality assumptions. Correlation analysis reveals weak relationships among variables, suggesting independent variations. A Chi-squared test between 'Country' and 'Import/Export' yielded a statistic of 248.08 and a p-value of 0.38, indicating no significant association, suggesting that trade patterns are influenced more by broader market conditions than by specific countries.

Overall, the analysis highlights the dataset's complexity and variability, emphasizing the need for tailored strategies in inventory management and financial forecasting while considering broader market trends.ategories at a glance.



1. Descriptive Analysis

Non-Categorical Data:

Measures of Central Tendency:
Calculating the Minimum, Maximum, Mean, Median, and Mode for key variables such as Quantity and Value to identify average transaction sizes and values, guiding strategic decisions.

Measures of Dispersion:
Analyzing the Range and Standard Deviation of transaction values to understand the variability in trade activities.

Composite Measure:
The Coefficient of Variation will highlight the relative variability in transaction sizes and values across different markets.

Categorical Data:
Frequency counts for variables like Country and Product will help identify major markets and popular product categories, assisting in prioritization for trade opportunities.

2. Inferential Analysis

Non-Categorical Data:

Test of Mean:
Conducting a t-test to compare average transaction values across different groups, identifying any significant differences.
Categorical Data:

Chi-square Test:
Analyzing the relationship between 'Country' and 'Import/Export' status to determine if there is a significant association between these variables.

3. Causal Analysis

Regression Analysis:
Implementing regression techniques to explore the relationship between transaction values and various factors such as country and product type. This analysis will aid in forecasting future trends and optimizing resource allocation in high-demand markets.

4. Visualization

Non-Categorical Data:
Utilizing scatter plots and line graphs to visualize relationships and trends in transaction metrics, facilitating better communication of findings.

Categorical Data:
Employing bar charts and pie charts to effectively present categorical data, making it easier to identify key markets and product categories at a glance.

## Observations

A notable finding from the analysis is that Burkina Faso and Argentina stand out as important export markets, especially in sectors such as clothing and machinery. This underscores their significance in the broader trading environment, suggesting that businesses could uncover profitable opportunities by concentrating on these countries. The high transaction volumes and values linked to these markets indicate a strong demand for particular products, which are essential for strategic planning and resource allocation in international trade initiatives. Gaining insights into these dynamics can enable companies to refine their market entry strategies and strengthen their competitive edge in these regions.

## Managerial Insights

The analysis delivers crucial managerial insights by emphasizing the need to concentrate on high-potential markets like Burkina Faso and Argentina. Businesses should direct resources toward marketing and distribution strategies that are customized for these areas, capitalizing on their strengths in clothing and machinery exports. Additionally, gaining a thorough understanding of the diverse payment terms and shipping methods used in these markets can enhance operational efficiency and improve customer satisfaction. By keeping an eye on trends and adapting to local market conditions, companies can position themselves to take advantage of new opportunities, ensuring ongoing growth in international trade.

In [1]:
import os
import pandas as pd

In [2]:
#. Load the Test Data.
my_df=pd.read_csv("C:\\Users\\Hp\\Downloads\\Imports_Exports_Dataset.csv")

In [5]:
#  Display the Dimensions of Test Data.
my_df.shape

(15000, 16)

In [7]:
#  Create a Unique Sample of 2001 Observations using Student Roll Number as Random State.
my_sample= my_df.sample(n=2001, random_state=55023)

In [9]:
#  Display the Dimensions of Sample Data.
my_sample.shape

(2001, 16)

In [11]:
# Display Sample Data Information.
my_sample.info

<bound method DataFrame.info of                              Transaction_ID         Country  Product  \
7575   99298f6d-4630-493f-a215-dba810d36228    Burkina Faso  hundred   
7898   051dcf24-03f2-4d3e-b8e6-d3bddd298a57     Isle of Man    style   
11534  5f0d9b14-56b3-4949-b0df-10cdfe78bd54       Argentina  subject   
10117  db46dd88-18d9-429f-825b-7c85c476714f         Armenia    write   
8867   e1c87ff9-ee9a-40c7-9b9a-3bbda7542293        Barbados   region   
...                                     ...             ...      ...   
2337   3bdc85dc-6c5f-4136-9b4c-3736c314f367        Dominica      not   
10733  1cfbbc55-772b-4e9f-b132-3f4a4893210a         Comoros      hit   
9781   357a144d-b3f2-4b17-b0c7-5b3359861e5e        Mongolia     fire   
10969  c88d9700-9b3b-47cd-8d54-fcfaa7986432     Philippines     upon   
7592   3708a8bc-6a57-42b7-ad9e-336642626e48  Norfolk Island   police   

      Import_Export  Quantity    Value        Date     Category  \
7575         Export      1052  3084.

In [13]:
# Display the First 05 Records of the Sample Data.
my_sample.head(5)

Unnamed: 0,Transaction_ID,Country,Product,Import_Export,Quantity,Value,Date,Category,Port,Customs_Code,Weight,Shipping_Method,Supplier,Customer,Invoice_Number,Payment_Terms
7575,99298f6d-4630-493f-a215-dba810d36228,Burkina Faso,hundred,Export,1052,3084.02,02-08-2024,Clothing,Greeneland,203424,2619.91,Land,Reynolds-Ortiz,Thomas Greene,67819102,Prepaid
7898,051dcf24-03f2-4d3e-b8e6-d3bddd298a57,Isle of Man,style,Import,6333,8201.51,27-08-2022,Electronics,South Amyport,238655,1504.02,Sea,Perez-Burns,Leah Gamble,86423608,Prepaid
11534,5f0d9b14-56b3-4949-b0df-10cdfe78bd54,Argentina,subject,Export,7125,6643.86,18-06-2022,Machinery,East April,727110,742.36,Air,Lopez-Hernandez,Robert Rogers,61577279,Net 60
10117,db46dd88-18d9-429f-825b-7c85c476714f,Armenia,write,Export,9757,852.29,20-08-2022,Clothing,Gibsonmouth,526807,269.98,Land,Little-Oliver,Lacey Ford,49929404,Net 60
8867,e1c87ff9-ee9a-40c7-9b9a-3bbda7542293,Barbados,region,Export,1156,1688.27,11-01-2024,Toys,West Michael,760320,1165.13,Air,"Turner, Choi and Hodge",Michele Wright,95880720,Prepaid


In [15]:
#Display the bottom 5 rows from the dataset using the tail() function.

bottom_rows = my_df.tail(5)
print(bottom_rows)

                             Transaction_ID           Country Product  \
14995  48df15a8-0823-4964-8c16-eddf2756f382  Marshall Islands     not   
14996  31106617-94a6-4646-a001-5e7bd45abc26           Bermuda     air   
14997  ee485839-fbde-4ced-af18-d98f5e863081          Tanzania    show   
14998  5acd54aa-ec8c-4055-be8b-a447861a471c            Tuvalu      TV   
14999  5cc039d0-a052-41fd-bfbb-c9f60c4565ac   North Macedonia    year   

      Import_Export  Quantity    Value        Date   Category  \
14995        Export      2860  2055.19  09-07-2024  Furniture   
14996        Export      2443  6407.06  18-06-2024  Furniture   
14997        Export      1702  9918.29  30-04-2020       Toys   
14998        Export      8108  9288.57  29-04-2021   Clothing   
14999        Import      5635   561.33  25-12-2019   Clothing   

                     Port  Customs_Code   Weight Shipping_Method  \
14995     South Karenfort        393463  4120.35            Land   
14996         Jeffreyside        4

In [17]:
# Drop irrelevant columns from the dataset using drop() function.
# Assuming 'df' is our DataFrame
bottom_rows = my_df.tail(5)
print(bottom_rows)

                             Transaction_ID           Country Product  \
14995  48df15a8-0823-4964-8c16-eddf2756f382  Marshall Islands     not   
14996  31106617-94a6-4646-a001-5e7bd45abc26           Bermuda     air   
14997  ee485839-fbde-4ced-af18-d98f5e863081          Tanzania    show   
14998  5acd54aa-ec8c-4055-be8b-a447861a471c            Tuvalu      TV   
14999  5cc039d0-a052-41fd-bfbb-c9f60c4565ac   North Macedonia    year   

      Import_Export  Quantity    Value        Date   Category  \
14995        Export      2860  2055.19  09-07-2024  Furniture   
14996        Export      2443  6407.06  18-06-2024  Furniture   
14997        Export      1702  9918.29  30-04-2020       Toys   
14998        Export      8108  9288.57  29-04-2021   Clothing   
14999        Import      5635   561.33  25-12-2019   Clothing   

                     Port  Customs_Code   Weight Shipping_Method  \
14995     South Karenfort        393463  4120.35            Land   
14996         Jeffreyside        4

In [19]:
# List the Names of Variables.
my_sample.columns

Index(['Transaction_ID', 'Country', 'Product', 'Import_Export', 'Quantity',
       'Value', 'Date', 'Category', 'Port', 'Customs_Code', 'Weight',
       'Shipping_Method', 'Supplier', 'Customer', 'Invoice_Number',
       'Payment_Terms'],
      dtype='object')

In [21]:
# Identify & list the following Variables:


#Index- Transaction_ID, Customs_Code, Invoice_Number, Product, Supplier, Customer
#Categorical- Nominal - Import_Export, Category, Shipping_Method, Payment_Terms, Country
#Categorical- Ordinal -
#Non-Categorical- Quantity, Value, Date, Port, Weight

In [23]:
# Subset the Non-Categorical Variables.
non_cat_df=my_sample[['Quantity', 'Value', 'Customs_Code','Weight']]
non_cat_df

Unnamed: 0,Quantity,Value,Customs_Code,Weight
7575,1052,3084.02,203424,2619.91
7898,6333,8201.51,238655,1504.02
11534,7125,6643.86,727110,742.36
10117,9757,852.29,526807,269.98
8867,1156,1688.27,760320,1165.13
...,...,...,...,...
2337,7580,2855.32,372601,529.55
10733,194,9484.10,277493,4992.64
9781,4398,3951.38,340273,3550.11
10969,8497,6503.68,482887,2138.66


In [25]:
# Display the Descriptive Statistics of the Non-Categorical Set.
non_cat_df.describe()

Unnamed: 0,Quantity,Value,Customs_Code,Weight
count,2001.0,2001.0,2001.0,2001.0
mean,4954.312344,5002.552574,546121.951024,2480.098651
std,2894.01631,2880.139487,261414.19462,1467.742314
min,9.0,102.12,100229.0,2.59
25%,2469.0,2429.59,321561.0,1214.83
50%,4906.0,5012.82,543855.0,2479.63
75%,7542.0,7451.05,773577.0,3732.32
max,9995.0,9996.72,999597.0,4999.52


In [27]:
#median for non-categorical
non_cat_df.median()

Quantity          4906.00
Value             5012.82
Customs_Code    543855.00
Weight            2479.63
dtype: float64

In [29]:
#mode for non-categorical variables
non_cat_df.mode()

Unnamed: 0,Quantity,Value,Customs_Code,Weight
0,3089,5790.7,321967.0,959.06
1,3664,,523474.0,2975.07
2,5061,,898007.0,3347.51
3,6462,,,4927.0
4,6495,,,
5,7951,,,
6,8679,,,
7,9646,,,


In [31]:
#Measures of Dispersion
#Range
non_cat_df.max()-non_cat_df.min()

Quantity          9986.00
Value             9894.60
Customs_Code    899368.00
Weight            4996.93
dtype: float64

A Quantity of 9,986 units and a Value of $9,894.60, reflecting a significant volume of goods. The average value of the Customs Code is around 899,368. This could indicate a particular classification for customs procedures. Furthermore, the Weight indicates a significant cargo of about 4,996.93 units. These numbers suggest a strong transaction with a large volume and value, which may be significant for evaluating inventory control and business performance.

In [34]:
#Skewness 
non_cat_df.skew()

Quantity        0.040181
Value           0.022186
Customs_Code    0.021586
Weight          0.044580
dtype: float64

The distributions for Quantity (0.0402), Value (0.0222), Customs Code (0.0216), and Weight (0.0446) are close to symmetrical, with a slight positive skewness, according to the supplied skewness values. This implies that while the data generally do not show significant asymmetry, there might be a tendency for a few larger values to stretch the tail to the right. Many statistical studies benefit from this relatively balanced distribution since it shows that the data is not significantly biased in one way.

In [37]:
#Kurtosis
non_cat_df.kurt()

Quantity       -1.183991
Value          -1.221064
Customs_Code   -1.210401
Weight         -1.206317
dtype: float64

Quantity (-1.184), Value (-1.221), Customs Code (-1.210), and Weight (-1.206) all have negative kurtosis values, indicating that their distributions are platykurtic. Compared to a normal distribution, this indicates that their tails are lighter and their peaks are flatter. This means that, compared to a normal distribution, the data points are more widely distributed and extreme values are less common. Evaluating the validity of statistical studies and models applied to these variables can be made easier by being aware of these features.

In [39]:
#Correlation
non_cat_df.corr()

Unnamed: 0,Quantity,Value,Customs_Code,Weight
Quantity,1.0,0.003417,-0.017677,-0.030232
Value,0.003417,1.0,0.002,-0.016178
Customs_Code,-0.017677,0.002,1.0,0.016651
Weight,-0.030232,-0.016178,0.016651,1.0


Weak correlations between Quantity, Value, Customs Code, and Weight are revealed by the correlation matrix. While there is a little negative connection between Quantity and Weight (-0.0302) and Customs Code (-0.0177), there is a minor positive correlation between Quantity and Value (0.0034). Value likewise shows very little correlation with Weight (-0.0162) and Customs Code (0.002), however there is very little positive correlation (0.0167) between Customs Code and Weight. The variables may be less useful in statistical analyses and predictive modeling as a result of their generally low correlation values, which show that the variables are not very dependent on one another.

In [41]:
#Composite Measure 
#Coefficient of Variation
import statistics as stats
cv = (stats.stdev(non_cat_df['Quantity']) / non_cat_df['Quantity'].mean()) * 100 
cv

58.414086732880854

For Quantity, the coefficient of variation (CV) is roughly 58.41%. Significant variability in the Quantity values in relation to the mean is shown by this high CV. This degree of dispersion indicates that transaction sizes are highly variable, which may be the result of various product categories or order sizes. Since this fluctuation draws attention to the inconsistent transaction quantities, it is crucial to comprehend for both efficient inventory management and financial planning.

In [43]:
cv = (stats.stdev(non_cat_df['Weight']) / non_cat_df['Weight'].mean()) * 100 
cv

59.18080371138371

Weight's coefficient of variation (CV) is roughly 59.18%. This high CV suggests that there is a significant variation in the weights of the shipments between transactions, as evidenced by the wide variation in weight relative to the mean. The need for attentive inventory and shipping strategy management is highlighted by the possibility that such unpredictability may have an influence on logistics planning and shipping costs. Comprehending this distribution is essential for enhancing functions and guaranteeing effective distribution of resources.

In [45]:
cv = (stats.stdev(non_cat_df['Value']) / non_cat_df['Value'].mean()) * 100 
cv

57.57339766518332

Value's coefficient of variation (CV) is roughly 57.57%. This suggests that the transaction values exhibit a considerable degree of variability in comparison to the mean. A CV this high implies that transaction values are very variable, which could be the result of different product prices or order quantities. Comprehending this variability is crucial for pricing strategies and financial forecasts, as it draws attention to possible disparities that may impact budgeting and revenue management.

In [47]:
import numpy as np
import pandas as pd
import scipy.stats as stats

#Confidence Interval
mean_value = non_cat_df['Value'].mean()
std_dev = non_cat_df['Value'].std()
n = len(non_cat_df['Value'])

# Confidence level (95%)
confidence_level = 0.95
alpha = 1 - confidence_level

# Calculate the critical value for the t-distribution
critical_value = stats.t.ppf(1 - alpha/2, df=n-1)

In [49]:
# Calculate the margin of error
margin_of_error = critical_value * (std_dev / np.sqrt(n))

In [51]:
# Calculate confidence interval
confidence_interval = (mean_value - margin_of_error, mean_value + margin_of_error)

In [53]:
# Display results
print(f"Mean: {mean_value:.2f}")
print(f"Standard Deviation: {std_dev:.2f}")
print(f"Confidence Interval (95%): {confidence_interval}")

Mean: 5002.55
Standard Deviation: 2880.14
Confidence Interval (95%): (4876.282343498177, 5128.822803928108)


With a standard deviation of 2,880.14 and an average value of roughly 5002.55, the transactions show considerable variation from the average. The mean's 95% confidence interval falls between 4876.28 and 5,128.82. This interval indicates that there is a 95% confidence interval in which the population's true mean falls. The significant standard deviation is reflected in the comparatively broad interval, highlighting the variation in transaction values. Making wise financial decisions and comprehending possible revenue variations require this information.

In [56]:
#Inferential Statistics
#Test of Mean {t}
t_stats1 = stats.ttest_rel(non_cat_df['Quantity'], non_cat_df['Weight'])
print('T-Statistic, p_value and df: ', t_stats1)

significance_level = 0.05

if t_stats1.pvalue < significance_level:
    print('Signifiance difference between the means')
else:
    print('No Signifiance difference between the means')



T-Statistic, p_value and df:  TtestResult(statistic=33.69937699204916, pvalue=1.4935568073031165e-197, df=2000)
Signifiance difference between the means


The p-value of 1.4935568073031165e-197, in conjunction with the t-statistic of around 33.70, suggests a highly significant difference between the means of the two groups under comparison. The p-value, at 2,000 degrees of freedom (df), is significantly lower than any typical significance level (e.g., 0.05 or 0.01), indicating statistical significance of the difference. We may safely reject the null hypothesis in light of this compelling evidence, coming to the conclusion that there is a significant difference between the means of the two datasets. These kinds of findings are essential to comprehending the effects of various factors or therapies under investigation.

In [59]:
#Test of Variance {F}
var_Quantity = np.var(non_cat_df['Quantity'], ddof = 1)
var_Weight = np.var(non_cat_df['Weight'], ddof = 1)

print('Variance of Quantity : ', var_Quantity)
print('Variance of Weight : ', var_Weight)

F = var_Quantity / var_Weight

dof1 = len(non_cat_df['Quantity']) - 1
dof2 = len(non_cat_df['Weight']) - 1

p_value = 1 - stats.f.cdf(F, dof1, dof2) # cdf = cumulative distribution function

p_value_2T = np.round(p_value * 2, 18)
print("p_value for 2 tail: ", p_value_2T)

significance_level = 0.05

if (p_value_2T < significance_level):
    print("The variances are significantly different")

else:
    print("The variances are not significantly different")

Variance of Quantity :  8375330.399892553
Variance of Weight :  2154267.5011997814
p_value for 2 tail:  2.22e-16
The variances are significantly different


The variance for Weight is roughly 2,154,267.50, and the variance for Quantity is about 8,375,330.40. The variances show a highly significant difference, as indicated by the two-tailed p-value of 2.22e-16. As the p-value is significantly less than typical significance thresholds (e.g., 0.05), we can reject the null hypothesis of equal variances with confidence. According to this finding, there is a considerable difference in the unpredictability of Quantity compared to Weight. This could have major effects on analysis and decision-making related to inventory management and operational efficiency.

In [62]:
#levene Test
stat, p_value = stats.levene(non_cat_df['Quantity'],non_cat_df['Value'],non_cat_df['Weight'], center = 'median')
print("levene's test stats: ", stat)
print("p_value: ", p_value)

significance_level = 0.05

if p_value < significance_level:
    print('Significant difference between the variances')

else:
    print('No significant difference between the variances')

levene's test stats:  646.0502019268343
p_value:  7.867411546715904e-255
Significant difference between the variances


Levene's test statistic has a p-value of 7.867411546715904e-255 and is around 646.05. A highly significant difference between the variances of the two groups under comparison is indicated by this incredibly low p-value. As the p-value is significantly lower than typical significance thresholds (like 0.05), we can safely rule out the null hypothesis that the variances are identical. This outcome verifies that the groups' variability differs significantly, which is essential for selecting the right statistical techniques for additional investigation.

In [65]:
#Test of Proportion {z}
Pro_Weight = 0 # Sample proportion
for i in range(len(non_cat_df['Weight'])):
    n = non_cat_df['Weight'].iloc[i]
    if (5000 <= n <= 4700):
        Pro_Weight += 1
    
Size_Weight = len(non_cat_df['Weight'])

print('Sample proporation: ', Pro_Weight)
print('Sample size: ', Size_Weight)

hypothesize_proportion = 100  # Hypothesized quantity proportion

print('hypothesized proportion', hypothesize_proportion)

standard_error = (hypothesize_proportion * (1 - hypothesize_proportion) / Size_Weight) ** 0.5

z_stats = (Pro_Weight - hypothesize_proportion) / standard_error

p_value = 2 * (1 - stats.norm.cdf(abs(z_stats)))

print('Z-stats: ', z_stats)
print('p_value: ', p_value)

significance_level = 0.05

if p_value < significance_level:
    print('hypothesized proportion is significantly different from the proportion quantity')
else:
    print('hypothesized proportion is not significantly different from the proportion quantity')

Sample proporation:  0
Sample size:  2001
hypothesized proportion 100
Z-stats:  (-2.752876973105865e-15+44.95789275769185j)
p_value:  0.0
hypothesized proportion is significantly different from the proportion quantity


With a sample size of 2,001 and a predicted proportion of 100, the sample proportion is 0. In the context of proportions, the calculated Z-statistic, which is a complex number roughly equal to (-2.752876973105865e-15+44.95789275769185j), indicates a mathematical error because Z-statistics should be real numbers. The sample proportion and the hypothesized proportion differ significantly, as indicated by the p-value of 0.0. The null hypothesis is strongly rejected since the sample proportion of zero implies that none of the observed cases met the predicted proportion. This finding indicates a glaring disparity between the predicted and observed proportions, indicating the need for additional research into the underlying causes.

In [68]:
#Test of Normality 
#Shapiro-Wilk
def shapiro_test(sample, name):
    stat, p_value = stats.shapiro(sample)
    print("Shapiro-wilk Test for: ", name)
    print("Test stats: ", stat, " and p-value: ", p_value)

    significance_level = 0.05
    if p_value > significance_level:
        print("Fail to reject (sample is consistent with normality)")
    else:
        print("Reject (sample does not appear to be normally distributed)")


shapiro_test(non_cat_df['Quantity'], "Quantity")
print("\n")
shapiro_test(non_cat_df['Weight'], "Weight")
print("\n")
shapiro_test(non_cat_df['Value'], "Value")

Shapiro-wilk Test for:  Quantity
Test stats:  0.9556257762681055  and p-value:  3.287814857898082e-24
Reject (sample does not appear to be normally distributed)


Shapiro-wilk Test for:  Weight
Test stats:  0.9526313996106658  and p-value:  5.977908673136602e-25
Reject (sample does not appear to be normally distributed)


Shapiro-wilk Test for:  Value
Test stats:  0.9521992922760996  and p-value:  4.706011349066878e-25
Reject (sample does not appear to be normally distributed)


According to the findings of the Shapiro-Wilk test, none of the samples seem to be regularly distributed for Value, Weight, or Quantity. The test statistic for Quantity has a p-value of 3.287814857898082e-24 and is roughly 0.956. The Weight statistic has a p-value of 5.977908673136602e-25 and a statistic of around 0.953. Finally, the value statistic has a p-value of 4.706011349066878e-25 and a statistic of around 0.952. The extraordinarily low p-values in every instance result in the rejection of the normalcy null hypothesis. This may have an impact on the future selection of statistical tests and analytic techniques since it implies that the distributions of these variables are either highly skewed or exhibit non-normal characteristics.

In [71]:
#Kolmogorov-Smirnov
statQ_W, p_valueQ_W = stats.ks_2samp(non_cat_df['Quantity'], non_cat_df['Weight'])
statW_V, p_valueW_V = stats.ks_2samp(non_cat_df['Weight'], non_cat_df['Value'])
statV_Q, p_valueV_Q = stats.ks_2samp(non_cat_df['Value'], non_cat_df['Quantity'])

print('Quantity and Weight test stats: ', statQ_W, 'p-value: ', p_valueQ_W)
print('Weight and Value test stats: ', statW_V, 'p-value: ', p_valueW_V)
print('Value and Quantity test stats: ', statV_Q, 'p-value: ', p_valueV_Q)

significance_level = 0.05
print('\n')

if p_valueQ_W > significance_level:
    print("Quantity and Weight are from same distribution (fail to reject)")
else:
    print("Quantity and Weight are from different distribution (rejected)")

print('\n')

if p_valueW_V > significance_level:
    print("Weight and Value are from same distribution (fail to reject)")
else:
    print("Weight and Value are from different distribution (rejected)")

print('\n')

if p_valueV_Q > significance_level:
    print("Value and Quantity are from same distribution (fail to reject)")
else:
    print("Value and Quantity are from different distribution (rejected)")

Quantity and Weight test stats:  0.48825587206396803 p-value:  1.1379143288855385e-216
Weight and Value test stats:  0.5012493753123438 p-value:  6.483858355470816e-229
Value and Quantity test stats:  0.020989505247376312 p-value:  0.7702933613731343


Quantity and Weight are from different distribution (rejected)


Weight and Value are from different distribution (rejected)


Value and Quantity are from same distribution (fail to reject)


With a test statistic of 0.488 and a p-value of 1.1379143288855385e-216, the study demonstrates that Quantity and Weight are from separate distributions, which leads to the rejection of the null hypothesis. Likewise, Weight and Value, with a p-value of 6.483858355470816e-229 and a test statistic of 0.501, likewise come from separate distributions. On the other hand, the Value and Quantity test produced a 0.021 statistic and a 0.770 p-value, suggesting that they are most likely drawn from the same distribution. These findings are crucial for directing additional research and comprehending the connections between these variables.

In [74]:
#  Subset and display the Categorical Variables.
cat_df = my_sample[['Import_Export', 'Category', 'Shipping_Method', 'Payment_Terms', 'Country']]
cat_df

Unnamed: 0,Import_Export,Category,Shipping_Method,Payment_Terms,Country
7575,Export,Clothing,Land,Prepaid,Burkina Faso
7898,Import,Electronics,Sea,Prepaid,Isle of Man
11534,Export,Machinery,Air,Net 60,Argentina
10117,Export,Clothing,Land,Net 60,Armenia
8867,Export,Toys,Air,Prepaid,Barbados
...,...,...,...,...,...
2337,Export,Machinery,Land,Prepaid,Dominica
10733,Export,Furniture,Air,Cash on Delivery,Comoros
9781,Export,Clothing,Sea,Prepaid,Mongolia
10969,Export,Furniture,Air,Net 60,Philippines


For 2,001 transactions, the categorical information contains crucial details including Import/Export status, Category, Shipping Method, Payment Terms, and Country. The majority of transactions are exports, with clothing, machinery, and electronics being prominent categories. There are other shipping methods, but the most popular ones are land, air, and sea. These methods might affect delivery costs and timelines. A variety of alternatives, such as Prepaid, Net 60, and Cash on Delivery, are shown in the payment terms, suggesting flexibility in the financial arrangements. The distribution of countries reveals a varied international trading environment that spans from the Isle of Man to Burkina Faso. This data structure facilitates a range of analytics, including trend analysis in export categories, evaluation of the effects of shipping methods on delivery efficiency, and identification of country-specific preferences for payment terms.

In [77]:
#Descriptive Data for Categorical Variables
cat_df.describe()

Unnamed: 0,Import_Export,Category,Shipping_Method,Payment_Terms,Country
count,2001,2001,2001,2001,2001
unique,2,5,3,4,243
top,Import,Machinery,Land,Net 60,Uzbekistan
freq,1028,428,707,513,18


In [79]:
#Proportion
# Calculate total
total = len(cat_df['Shipping_Method'])
total

2001

In [81]:
#Minimum
cat_df.min()

Import_Export                Export
Category                   Clothing
Shipping_Method                 Air
Payment_Terms      Cash on Delivery
Country                 Afghanistan
dtype: object

The data entry that is shown pertains to a transaction that is classified as an export of apparel from Afghanistan. "Cash on Delivery" is the payment term, and "Air" is the shipment method. This particular entry highlights the features of the larger dataset, which helps to understand market dynamics through a variety of import/export actions, classifications, and payment types. Such entries can be analyzed to find patterns in popular product categories, shipping methods, and preferred payment methods internationally.

In [84]:
#Maximum
cat_df.max()

Import_Export        Import
Category               Toys
Shipping_Method         Sea
Payment_Terms       Prepaid
Country            Zimbabwe
dtype: object

The data entry that has been provided describes a transaction that is classified as a toy import from Zimbabwe. It functions under "Prepaid" payment terms and ships using "Sea" as the shipping method. This entry demonstrates the variety of import/export actions, product types, and payment methods that are present in the dataset. By examining these entries, one can gain a better knowledge of the mechanics of international trade by identifying patterns in shipping preferences and popular goods in different nations.

In [87]:
#Mode
cat_df.mode()

Unnamed: 0,Import_Export,Category,Shipping_Method,Payment_Terms,Country
0,Import,Machinery,Land,Net 60,Congo
1,,,,,Uzbekistan


An import from the Democratic Republic of the Congo, with "Land" being the delivery mode and "Net 60" as the terms of payment. This illustrates a particular type of transaction and words that are frequently connected to imports of machinery.

In [90]:
#Rank
cat_df.rank().head(1)

Unnamed: 0,Import_Export,Category,Shipping_Method,Payment_Terms,Country
7575,487.0,203.5,1013.0,1750.0,312.0


When the link between "Country" and "Import_Export" was examined using the Chi-squared test, the results showed a statistic of around 248.08 and a p-value of 0.38. We are unable to reject the null hypothesis since the p-value is significantly higher than the 0.05 threshold, suggesting that there is no meaningful relationship between the nation and the classification of transactions as imports or exports. This shows that trade patterns are often constant between nations, suggesting that variables peculiar to a country may not have as much of an impact on trade decisions as do broader market conditions. Consequently, rather of adjusting tactics based just on country of origin, corporations and policymakers may choose to concentrate on broad market trends. Additional research on the specific dynamics of each nation may potentially yield insightful information.