<a href="https://colab.research.google.com/github/SergeyHSE/LinearRegressor.github.io/blob/main/RegressionAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Analysis of descriptive statistics
2. Оценивание регрессионной модели в рамках предпосылок классической линейной регрессионной модели (КЛРМ)
3. Оценивание регрессионной модели в условиях отклонений от предпосылок КЛРМ:
 - Diagnosis of single atypical observations
 - Diagnosis of sample homogeneity
 - Diagnostics of specification errors and model correction when they are detected
 - Diagnosis of multicollinearity and model correction when it is detected
 - Diagnosis of heteroscedasticity and model correction when it is detected
 - Diagnosis of endogeneity and model correction when it is detected


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.diagnostic as dg
import scipy.stats
from scipy.stats import boxcox

When studying the literature on this topic, the most interesting articles were "Economic efficiency of beef cattle production in Thailand" by Professor Suneeporn Suwanmaneepong of King Mongkut's Institute of Technology Ladkrabang Faculty of Agricultural Technology PhD and "Assessment of technical efficiency and its determinants in beef cattle production in Kenya" by Eric Ruto of Lincoln University. In this paper, the professor describes the economic efficiency of livestock production. To build the model she uses the following variables as the most efficient ones: cost of feed and additives, equipment, drugs and labor, access to priority markets, etc. Unfortunately, our data do not contain information on the costs of purchasing veterinary drugs, so we will not be able to analyze their impact on the profitability of the enterprise. Therefore, we will do something else: we will deduct from the cost price all the cost items that we have. This will give us the amount including the costs of veterinary drugs.

Moreover, both authors conclude in the conclusions of their studies that there is a need for government intervention with different types of assistance such as:
- Improving farmers' access to the knowledge they need to develop their farms as well as their farming skills
- Providing access to more modern technologies
- Improving access to market services
- Creating opportunities for off-farm income generation.

All these factors are in one way or another related to government support, to a certain type of subsidy, which directly, according to the authors, should improve profit margins, and therefore improve the model's performance.


In [2]:
from google.colab import files
file = files.upload()

Saving agro_census.dta to agro_census.dta


In [3]:
data = pd.read_stata('agro_census.dta')
data.columns, data.shape

(Index(['NPPP', 'COD_COATO', 'KFS', 'KOPF', 'OKVED', 'land_total',
        'cost_milk_KRS', 'cost_KRS_food', 'cost_meat_KRS', 'AB_1', 'CF_1',
        'short_credit', 'long_credit', 'debit_debt', 'credit_debt',
        'gov_sup_plant', 'gov_sup_seed', 'gov_sup_grain', 'subs_plant',
        'subs_grain', 'gov_sup_farming', 'gov_sup_KRS', 'subs_prod_farm',
        'subs_milk', 'subs_meat', 'subs_KRS', 'subs_combikorm', 'sub_chemistry',
        'subs_fuel', 'farms_number', 'profit_farms_number',
        'unprofit_farms_number', 'capital', 'profit', 'unprofit', 'J', 'O',
        'empl_org', 'empl_prod', 'V', 'W', 'X', 'AN', 'AO', 'AP', 'AQ', 'AR',
        'BE', 'BF', 'BG', 'BQ', 'BR', 'BS', 'BT', 'BU', 'BY', 'BZ', 'CA',
        'salary_plant', 'salary_farm', 'DB', 'DC', 'DF', 'DG', 'DH', 'DI', 'DK',
        'DO', 'DT', 'EC', 'EG', 'EJ', 'EK', 'ER', 'ES', '_merge'],
       dtype='object'),
 (6287, 76))

In [5]:
df = data[(data['OKVED'] == '01.21')]
df.shape

(2595, 76)

In [6]:
df['net_profit'] = df['profit'] - df['unprofit']
df['other_cost'] = df['BQ'] - df['salary_farm'] - df['DC'] - df['DI']
df['subsidies'] = df['gov_sup_KRS'] + df['subs_prod_farm'] + df['subs_milk'] + df['subs_KRS'] + df['subs_combikorm'] + df['subs_fuel']
df['debt'] = df['credit_debt'] - df['debit_debt']
df['cost_services'] = df['J'] - df['BQ']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['net_profit'] = df['profit'] - df['unprofit']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['other_cost'] = df['BQ'] - df['salary_farm'] - df['DC'] - df['DI']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['subsidies'] = df['gov_sup_KRS'] + df['subs_prod_farm'] + df['subs_milk'] + df['sub

In [7]:
df.rename(columns={'DC' : 'amortization',
                   'DI' : 'social_cost'}, inplace=True)
df['output'] = df['AP'] + df['BE'] + df['BS']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns={'DC' : 'amortization',
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['output'] = df['AP'] + df['BE'] + df['BS']


In [8]:
column_names = ['net_profit', 'other_cost', 'subsidies', 'debt',
                'cost_services', 'amortization', 'output', 'salary_farm',
                'empl_org', 'KOPF', 'social_cost']
livestock = df[column_names]
livestock.shape

(2595, 11)

We ended up with the following variables:
net_profit - net profit of livestock production
rentabel - profitability (net profit to revenue ratio)
other_cost - costs, which include, among other things, costs for repayment of loans and for purchase of veterinary drugs.
social_cost - deductions for social needs
subsidies - total amount of subsidies, including subsidies for milk and meat production, fuel subsidies, etc.
debt - current short-term debts (difference between accounts payable and accounts receivable).
cost_services - costs of realization of services, works (difference between the cost of sold goods, products, works, services and the cost of sale of livestock products
amortization - amortization
output - gross output of milk, meat, cattle.
salary_farm - labor costs.
empl_org - Average annual number of employees of the agricultural organization
KOPF - (
42
Unitary enterprises, based on the right of economic management
47
Open joint stock companies
52
Production cooperatives
65
Limited liability companies
67
Closed joint-stock companies
54
Collective farms
55
State farms)

In [None]:
livestock.head()

Unnamed: 0,net_profit,other_cost,subsidies,debt,cost_services,amortization,output,salary_farm,empl_org,KOPF,social_cost
0,8931.0,30701.0,3892.0,26477.0,48314.0,2114.0,2693.0,8735.0,294,47,1147
1,2495.0,13939.0,3710.0,11271.0,3138.0,582.0,1678.0,3827.0,166,52,490
2,98.0,16900.0,4096.0,18952.0,7885.0,572.0,3043.0,6790.0,235,67,1014
3,-4868.0,6460.0,207.0,10718.0,2083.0,162.0,668.0,1432.0,95,52,182
6,-3457.0,7502.0,825.0,3060.0,8783.0,0.0,332.0,1645.0,136,67,210


In [None]:
# Calculate NaN
livestock.isnull().sum()

net_profit       0
other_cost       0
subsidies        0
debt             0
cost_services    0
amortization     0
output           0
salary_farm      0
empl_org         0
KOPF             0
social_cost      0
dtype: int64

In [None]:
# Calculate zeros
(livestock == 0).sum()

net_profit        25
other_cost        70
subsidies        249
debt              25
cost_services     78
amortization     154
output           101
salary_farm       82
empl_org           0
KOPF               0
social_cost       90
dtype: int64

In [None]:
livestock = livestock.loc[~(livestock == 0).all(axis=1)]
livestock.shape

(2595, 11)

In [None]:
livestock.describe()

Unnamed: 0,net_profit,other_cost,subsidies,debt,cost_services,amortization,output,salary_farm,empl_org,KOPF,social_cost
count,2595.0,2595.0,2595.0,2595.0,2595.0,2595.0,2595.0,2595.0,2595.0,2595.0,2595.0
mean,7841.578343,22817.386898,4313.216817,14084.024663,10753.023815,1647.998073,2848.882312,7750.120617,152.261657,55.442389,1179.604239
std,26857.090812,35745.317554,8416.003095,49316.495885,24358.144435,2661.216427,7633.231854,9493.783825,129.00354,8.535656,1579.778162
min,-117594.0,-26608.0,0.0,-642913.0,-72855.0,0.0,0.0,0.0,1.0,42.0,0.0
25%,134.5,6547.0,510.5,916.0,1391.0,224.0,629.0,2532.0,81.0,52.0,323.0
50%,2889.0,13543.0,1669.0,5085.0,4462.0,742.0,1292.0,5323.0,125.0,52.0,710.0
75%,9903.0,26494.0,4685.0,15042.0,10921.5,1900.0,2368.0,9783.0,189.5,65.0,1476.0
max,732684.0,771320.0,163506.0,842063.0,529553.0,32719.0,120690.0,225988.0,2251.0,67.0,27666.0


In [9]:
(livestock < 0).sum()

net_profit       469
other_cost        15
subsidies          0
debt             380
cost_services     31
amortization       0
output             0
salary_farm        0
empl_org           0
KOPF               0
social_cost        0
dtype: int64

In [12]:
anomaly_cost = livestock[(livestock['other_cost'] < 0)]
anomaly_cost

Unnamed: 0,net_profit,other_cost,subsidies,debt,cost_services,amortization,output,salary_farm,empl_org,KOPF,social_cost
895,-9.0,-556.0,0.0,732.0,11.0,154.0,39.0,679.0,18,52,95
1631,-5449.0,-528.0,0.0,10825.0,802.0,549.0,7.0,1795.0,26,47,185
2794,38.0,-876.0,0.0,-180.0,6734.0,243.0,156.0,2117.0,26,42,24
4303,-20.0,-25.0,0.0,694.0,0.0,17.0,25.0,144.0,10,52,17
4367,-31628.0,-13510.0,0.0,-595802.0,7605.0,3906.0,117.0,19285.0,84,65,3675
4612,1008.0,-14886.0,0.0,15634.0,3554.0,4328.0,0.0,20070.0,91,47,2112
4758,16390.0,-26608.0,392.0,22267.0,90195.0,8404.0,0.0,16138.0,276,52,2066
5106,1115.0,-15747.0,1663.0,68254.0,1636.0,12559.0,262.0,4400.0,91,65,972
5210,336.0,-1.0,161.0,7198.0,2586.0,1.0,0.0,0.0,17,52,0
5212,26.0,-95.0,0.0,273.0,0.0,59.0,9.0,190.0,12,52,29


In [13]:
anomaly_services = livestock[(livestock['cost_services'] < 0)]
anomaly_services

Unnamed: 0,net_profit,other_cost,subsidies,debt,cost_services,amortization,output,salary_farm,empl_org,KOPF,social_cost
882,3246.0,17078.0,224.0,-8037.0,-4977.0,825.0,0.0,9498.0,90,52,2488
1022,-44210.0,127529.0,0.0,17251.0,-72855.0,3627.0,5990.0,14482.0,163,65,3865
1315,2266.0,71720.0,2121.0,37021.0,-11662.0,3045.0,3086.0,17424.0,135,42,3728
1317,3863.0,3017.0,4.0,-2013.0,-687.0,522.0,121.0,4712.0,37,47,573
1320,13795.0,53357.0,5506.0,2632.0,-5282.0,6690.0,2612.0,13606.0,166,52,3062
1743,385.0,27871.0,2765.0,23167.0,-1048.0,2471.0,958.0,8536.0,152,65,1920
1781,578.0,8560.0,998.0,3985.0,-148.0,114.0,808.0,2029.0,104,54,221
2555,6854.0,72272.0,8643.0,6776.0,-12981.0,13323.0,2647.0,10328.0,132,47,1012
2567,25675.0,66288.0,9786.0,-11910.0,-8885.0,2714.0,2360.0,12315.0,128,52,1946
2568,-1147.0,49336.0,4674.0,34129.0,-8244.0,1942.0,1650.0,19253.0,183,52,3042
