These report encompasses key statistical concepts and their application to various datasets. I begins with an overview of fundamental statistical termms, including dependent, independent, and control variables, along with levels of measurement---nominal, ordinal, interval, and ratio. Explaining the criteria for choosing statistical techniques based on variable types and research questions, distinguishing between descriptive and inferential statistics. The report highlights the impact of violating statistical assumptions and the importance of selecting appropriate tests, such as the chi-square test for nominal variables, t-tests for continuous data, and ANOVA for multiple groups. Also covering key concepts

The time series data analysis segment applies visual inspection, autocorrelation functions, and statistical tests (KPSS and ADF) to evaluate the constancy of means over time for wind speed and Microsoft stock trading volume. The findings reveal that while wind speed showed a constant mean, the stock volume results were mixed, emphasizing the need for multiple testing methods. The final section compares ARIMA models for US unemployment rate data, analyzing different model fits and autocorrelation. The analysis illustrates the effectiveness of ARIMA models in capturing time series patterns, with varying degrees of model performance and fit statistics across different configurations.

# The Power of Statistics, Levels of Measurement, and Statistical Techniques

This report summarizes key concepts in statistics, including variable types, levels of measurement, and the application of statistical techniques. The content reflects the work I completed as part of my practicum in the UTSA Master's program in Data Analytics. Several different to intepret datset where used, as well as using Gretl to conduct my analysis

**Dependent Variable**: A variable that depends on the changes made to the independent variable.

**Independent Variable**: The variable that you manipulate or control to observe its effect on the dependent variable.

**Control Variable**: A variable that may affect the dependent variable but is not of primary interest in the study.

**Levels of Measurement**

1.  Nominal: Categorical data with distinct groups, without any order or ranking.

-   Example: Eye color (brown, black, blue).

2.  Ordinal: Categorical data that can be ordered or ranked.

-   Example: Education levels (high school diploma, bachelor's, master's, Ph.D.).

3.  Interval: Data with equal intervals between values, but no true zero. Example: Temperature in Celsius. Ratio: Data with equal intervals and an absolute zero point.

-   Example: Height.

**Criteria for Choosing Statistical Techniques**

According to the [**IDRE Chart**](https://stats.oarc.ucla.edu/other/mult-pkg/whatstat/), two important criteria for selecting a statistical technique are:

-   The levels of measurement of both the dependent and independent variables.
-   The number of dependent and independent variables.

Additionally, the nature of the research question plays a crucial role in determining the appropriate technique.

**Descriptive vs. Inferential Statistics**

-   Descriptive Statistics: Summarize and present the main features of the data.
-   Inferential Statistics: Make predictions and draw conclusions based on the data.

**Assumptions in Statistical Tests**

-   Benefit: Proper assumptions ensure reliable and trustworthy results.
-   Cost: Violating assumptions can compromise the validity and credibility of the test.

**Impact of Violating Assumptions**

Violating the assumptions of a statistical test can undermine the integrity of the analysis, leading to questionable results and a loss of credibility.

**Selecting Appropriate Statistical Tests**

1.  Colored Contact Lenses and Gender:

-   Variables: Color of lenses (independent), gender (dependent).
-   Test: Chi-square test (both variables are nominal).

2.  Art Auction Prices by Gender:

-   Variables: Gender (independent), price paid for art (dependent).
-   Test: t-test (price is continuous).

3.  Vegematic Sales by Product and Demographics:

-   Variables: Product color, price, region, gender, household income (independent), number of products sold (dependent).
-   Test: ANOVA.

4.  Magazine Pages and Sales:

-   Variables: Number of pages (independent), copies sold (dependent).
-   Test: Pearson correlation.

5.  Hair Growth in Cats by Drug, Gender, and Coat Color:

-   Variables: Drug, cat gender, coat color (independent), hair growth (dependent).
-   Test: Three-way ANOVA.

Other Key Statical Concepts:

**Population and Sample**

-   Population: The entire set or group being studied.
-   Sample: A subset of the population used for the study.
-   Inferential Statistics: The goal is to make predictions about the population based on the sample.

**Measures of Central Tendency**

-   Mean: The average of all data points.

-   Median: The middle value when data is arranged in order.

-   Example: In an exam, the mean tells you the overall class performance, while the median can indicate the typical student's performance.

**Variance and Standard Deviation**

-   Variance: Measures how much data points deviate from the mean.

-   Standard Deviation: The square root of the variance, indicating the spread of data around the mean.

When data points are tightly packed around the mean, variance and standard deviation are lower.

**Confidence Intervals**

-   Confidence Interval: Represents the range in which a population parameter is expected to fall a certain percentage of the time.
-   For example, in a survey of 150 voters where 45% support a candidate, a 95% confidence interval around this proportion ranges from \~0.3704 to 0.5296.

**Z-Scores**

-   Z-Score: Measures how many standard deviations a data point is from the mean.
-   Example: For UTSA student ages with a mean of 26 and a standard deviation of 4, the Z-score for a student aged 24 is calculated as:

$$
Z = \frac{24 - 26}{4} = -0.5
$$

# Time Series Data Analysis Report

**Introduction**: The goal of this section is to analyze time series datasets by investigating whether their mean remains constant over time. This analysis involves importing time series data into GRETL, visualizing the data, and applying statistical tests to determine whether the time series exhibits a constant mean. The methods used include visual inspection through plotting and formal statistical testing via the KPSS and Augmented Dickey-Fuller (ADF) tests.

Dataset: [**Wind Speed in Delhi**](https://www.kaggle.com/datasets/sumanthvrao/daily-climate-time-series-data) The first dataset selected for analysis records wind speed (measured in km/h) in Delhi, India, from January 1, 2013, to April 24, 2017.

To begin the analysis, the wind speed data was plotted over time. By visually inspecting the plot, it appeared that the mean wind speed remained relatively constant throughout the time period. However, a subtle trend was observable in certain segments of the data, which warranted further investigation.

![](marget-segmentation/Plot1.png){width="600"}

An autocorrelation function (ACF) plot was generated to assess the presence of autocorrelation and trends. The ACF plot exhibited a slight decrease in correlation as the lag increased, indicating autocorrelation in the dataset. This trend suggested that the mean might not be perfectly constant across the entire time series.

![](marget-segmentation/Picture2.png)

To formally test for a constant mean, two statistical tests were applied to the dataset:

**KPSS test for wind_speed**

**T = 1462**

**Lag truncation parameter = 7**

**Test statistic = 0.216261**

**10%      5%      1%**

**Critical values: 0.348   0.462   0.743**

**P-value \> .10**

-   KPSS Test: The null hypothesis (Ho) of the KPSS test states that the mean is constant over time. The test statistic for the wind speed dataset was 0.216261, and with a p-value greater than 0.10, we failed to reject the null hypothesis. Thus, the KPSS test suggested that the mean wind speed is constant over time.

**Augmented Dickey-Fuller test for wind_speed**

**testing down from 23 lags, criterion AIC**

**sample size 1438**

**unit-root null hypothesis: a = 1**

**test with constant**

**including 23 lags of (1-L)wind_speed**

**model: (1-L)y = b0 + (a-1)\*y(-1) + \... + e**

**estimated value of (a - 1): -0.211152**

**test statistic: tau_c(1) = -4.11449**

**asymptotic p-value 0.0009163**

**1st-order autocorrelation coeff. for e: -0.003**

**lagged differences: F(23, 1413) = 3.597 \[0.0000**\]

Augmented Dickey-Fuller (ADF) Test: The ADF test's null hypothesis (Ho) posits that the series has a unit root, implying a non-constant mean. The test statistic for the wind speed data was -4.11449, with a p-value of 0.0009163. This result led to the rejection of the null hypothesis, indicating that the mean wind speed was likely constant over time.

Dataset 2: [**Microsoft Stock- Time Series Analysis**](https://www.kaggle.com/datasets/vijayvvenkitesh/microsoft-stock-time-series-analysis)

The stock trading volume for Microsoft from April 1, 2015, to April 1, 2021, was analyzed to examine the behavior of stock volume over time.

![](marget-segmentation/Picture3.png)

The plot of Microsoft's stock trading volume over time suggested that the mean volume was relatively stable throughout the observed period. No significant upward or downward trends were apparent to the naked eye.

![](marget-segmentation/Picture4.png){width="667"}

-   ACF Plot Analysis: An ACF plot of the stock volume data showed a gradual decline in autocorrelation as the lag increased. This indicated autocorrelation in the dataset, but the presence of a trend or non-constant mean was less clear from the plot alone.

Statistical Testing To further investigate, the same two statistical tests were applied:

**KPSS test for Volume**

**T = 1511**

**Lag truncation parameter = 7**

**Test statistic = 0.646783**

**10%      5%      1%**

**Critical values: 0.348   0.462   0.743**

**Interpolated p-value 0.024**

-   KPSS Test: For the stock volume data, the KPSS test statistic was 0.646783, with a p-value of 0.024. Since the p-value was less than the significance level of 0.05, we rejected the null hypothesis, concluding that the mean stock volume was not constant over time.

**Augmented Dickey-Fuller test for Volume**

**testing down from 23 lags, criterion AIC**

**sample size 1502**

**unit-root null hypothesis: a = 1**

**test with constant**

**including 8 lags of (1-L)Volume**

**model: (1-L)y = b0 + (a-1)\*y(-1) + \... + e**

**estimated value of (a - 1): -0.185615**

**test statistic: tau_c(1) = -6.89966**

**asymptotic p-value 6.158e-10**

**1st-order autocorrelation coeff. for e: -0.001**

**lagged differences: F(8, 1492) = 13.263 \[0.0000\]**

-   Augmented Dickey-Fuller (ADF) Test: The ADF test for stock volume yielded a test statistic of -6.89966 with a p-value of 6.158e-10. As the p-value was below the significance level, the null hypothesis of a non-constant mean was rejected, indicating that the mean trading volume was constant over time.

In conclusion this analysis applied visual inspection, ACF plots, and statistical tests to assess whether the mean of two time series datasets remained constant over time. The results varied across datasets. For the wind speed data, both visual inspection and statistical testing suggested a constant mean over time. In the case of Microsoft stock volume, the KPSS test indicated a non-constant mean, while the ADF test suggested otherwise, pointing to the importance of using multiple tests in time series analysis. This report demonstrates how combining visual methods with formal statistical testing can provide a robust understanding of the characteristics of time series data.

# ARIMA Model Comparison

This report covers the analysis of the unemployment rate in the US from January 1948 to March 2020. The analysis includes time series plotting, differencing to achieve stationarity, and fitting various ARIMA models to the differenced data. The models are compared using several metrics including AIC, BIC, and the Ljung-Box test.

*Description:* The plot below shows the unemployment rate over time from January 1948 to March 2020.

![](Picture5.png)

The unit root tests suggest a non-constant mean so here is the plot of the first differenced data:

![](Picture26.png)

ARIMA Model Analysis:

**Model 1: ARMA, using observations 1948:02-2020:03 (T = 866)**\
**Dependent variable:** d_UNRATE\
**Standard errors based on Hessian**

| Coefficient | Estimate  | Std. Error | z-value  | p-value  | Significance |
|-------------|-----------|------------|----------|----------|--------------|
| const       | 0.002873  | 0.014791   | 0.1942   | 0.8460   |              |
| phi_1       | 0.870665  | 0.029667   | 29.3500  | \<0.0001 | \*\*\*       |
| theta_1     | -0.718031 | 0.037947   | -18.9200 | \<0.0001 | \*\*\*       |

**Model Fit Statistics:**

| Statistic           | Value     |
|---------------------|-----------|
| Mean dependent var  | 0.001155  |
| S.D. dependent var  | 0.209924  |
| Mean of innovations | -0.000378 |
| S.D. of innovations | 0.200521  |
| R-squared           | 0.086522  |
| Adjusted R-squared  | 0.085465  |
| Log-likelihood      | 162.6270  |
| Akaike criterion    | -317.2540 |
| Schwarz criterion   | -298.1985 |
| Hannan-Quinn        | -309.9612 |

**Roots of AR and MA Polynomials:**

| Type | Root   | Real   | Imaginary | Modulus | Frequency |
|------|--------|--------|-----------|---------|-----------|
| AR   | Root 1 | 1.1485 | 0.0000    | 1.1485  | 0.0000    |
| MA   | Root 1 | 1.3927 | 0.0000    | 1.3927  | 0.0000    |

**Autocorrelation Test:**

Test for autocorrelation up to order 12:

-   **Ljung-Box Q'** = 75.3636
-   **p-value** = P(Chi-square(10) \> 75.3636) = 4.042e-012

Model 2

**Model 2: ARMA, using observations 1948:02-2020:03 (T = 866)**\
**Dependent variable:** d_UNRATE\
**Standard errors based on Hessian**

| Coefficient | Estimate  | Std. Error | z-value | p-value  | Significance |
|-------------|-----------|------------|---------|----------|--------------|
| const       | 0.002986  | 0.014898   | 0.2004  | 0.8412   |              |
| phi_1       | 0.555245  | 0.062518   | 8.881   | \<0.0001 | \*\*\*       |
| phi_2       | 0.238727  | 0.037380   | 6.386   | \<0.0001 | \*\*\*       |
| theta_1     | -0.538385 | 0.058356   | -9.226  | \<0.0001 | \*\*\*       |

**Model Fit Statistics:**

| Statistic           | Value     |
|---------------------|-----------|
| Mean dependent var  | 0.001155  |
| S.D. dependent var  | 0.209924  |
| Mean of innovations | -0.000420 |
| S.D. of innovations | 0.196462  |
| R-squared           | 0.123133  |
| Adjusted R-squared  | 0.121101  |
| Log-likelihood      | 180.2785  |
| Akaike criterion    | -350.5570 |
| Schwarz criterion   | -326.7375 |
| Hannan-Quinn        | -341.4410 |

**Roots of AR and MA Polynomials:**

| Type | Root   | Real    | Imaginary | Modulus | Frequency |
|------|--------|---------|-----------|---------|-----------|
| AR   | Root 1 | 1.1911  | 0.0000    | 1.1911  | 0.0000    |
| AR   | Root 2 | -3.5169 | 0.0000    | 3.5169  | 0.5000    |
| MA   | Root 1 | 1.8574  | 0.0000    | 1.8574  | 0.0000    |

**Autocorrelation Test:**

Test for autocorrelation up to order 12:

-   **Ljung-Box Q'** = 36.8101
-   **p-value** = P(Chi-square(9) \> 36.8101) = 2.845e-005

Model 3

**Model 3: ARMA, using observations 1948:02-2020:03 (T = 866)**\
**Dependent variable:** d_UNRATE\
**Standard errors based on Hessian**

| Coefficient | Estimate  | Std. Error | z-value  | p-value  | Significance |
|-------------|-----------|------------|----------|----------|--------------|
| const       | 0.002579  | 0.011520   | 0.2239   | 0.8228   |              |
| phi_1       | 1.655610  | 0.037484   | 44.1700  | \<0.0001 | \*\*\*       |
| phi_2       | -0.782771 | 0.043359   | -18.0500 | \<0.0001 | \*\*\*       |
| theta_1     | -1.641770 | 0.038375   | -42.7800 | \<0.0001 | \*\*\*       |
| theta_2     | 0.863215  | 0.047917   | 18.0100  | \<0.0001 | \*\*\*       |

**Model Fit Statistics:**

| Statistic           | Value     |
|---------------------|-----------|
| Mean dependent var  | 0.001155  |
| S.D. dependent var  | 0.209924  |
| Mean of innovations | -0.000443 |
| S.D. of innovations | 0.194870  |
| R-squared           | 0.137289  |
| Adjusted R-squared  | 0.134286  |
| Log-likelihood      | 187.0535  |
| Akaike criterion    | -362.1069 |
| Schwarz criterion   | -333.5236 |
| Hannan-Quinn        | -351.1678 |

**Roots of AR and MA Polynomials:**

| Type | Root   | Real   | Imaginary | Modulus | Frequency |
|------|--------|--------|-----------|---------|-----------|
| AR   | Root 1 | 1.0575 | -0.3989   | 1.1303  | -0.0574   |
| AR   | Root 2 | 1.0575 | 0.3989    | 1.1303  | 0.0574    |
| MA   | Root 1 | 0.9510 | -0.5041   | 1.0763  | -0.0776   |
| MA   | Root 2 | 0.9510 | 0.5041    | 1.0763  | 0.0776    |

**Autocorrelation Test:**

Test for autocorrelation up to order 12:

-   **Ljung-Box Q'** = 39.2977
-   **p-value** = P(Chi-square(8) \> 39.2977) = 4.328e-006

Model 4

**Model 4: ARMA, using observations 1948:02-2020:03 (T = 866)**\
**Dependent variable:** d_UNRATE\
**Standard errors based on Hessian**

| Coefficient | Estimate  | Std. Error | z-value  | p-value  | Significance |
|-------------|-----------|------------|----------|----------|--------------|
| const       | 0.002507  | 0.011390   | 0.2201   | 0.8258   |              |
| phi_1       | 0.578072  | 0.062491   | 9.250    | \<0.0001 | \*\*\*       |
| phi_2       | 0.117027  | 0.073948   | 1.583    | 0.1135   |              |
| phi_3       | 0.611279  | 0.108845   | 5.616    | \<0.0001 | \*\*\*       |
| phi_4       | -0.695650 | 0.055781   | -12.4700 | \<0.0001 | \*\*\*       |
| theta_1     | -0.585967 | 0.067105   | -8.732   | \<0.0001 | \*\*\*       |
| theta_2     | 0.063179  | 0.074000   | 0.8538   | 0.3932   |              |
| theta_3     | -0.595233 | 0.107839   | -5.520   | \<0.0001 | \*\*\*       |
| theta_4     | 0.766918  | 0.069361   | 11.060   | \<0.0001 | \*\*\*       |
| theta_5     | 0.030504  | 0.070963   | 0.4299   | 0.6673   |              |

**Model Fit Statistics:**

| Statistic           | Value     |
|---------------------|-----------|
| Mean dependent var  | 0.001155  |
| S.D. dependent var  | 0.209924  |
| Mean of innovations | -0.000422 |
| S.D. of innovations | 0.192210  |
| R-squared           | 0.160680  |
| Adjusted R-squared  | 0.152845  |
| Log-likelihood      | 198.7941  |
| Akaike criterion    | -375.5881 |
| Schwarz criterion   | -323.1854 |
| Hannan-Quinn        | -355.5330 |

**Roots of AR and MA Polynomials:**

| Type | Root   | Real    | Imaginary | Modulus | Frequency |
|------|--------|---------|-----------|---------|-----------|
| AR   | Root 1 | 1.0508  | 0.4052    | 1.1262  | 0.0586    |
| AR   | Root 2 | 1.0508  | -0.4052   | 1.1262  | -0.0586   |
| AR   | Root 3 | -0.6114 | -0.8715   | 1.0646  | -0.3474   |
| AR   | Root 4 | -0.6114 | 0.8715    | 1.0646  | 0.3474    |
| MA   | Root 1 | 0.9450  | 0.5028    | 1.0704  | 0.0778    |
| MA   | Root 2 | 0.9450  | -0.5028   | 1.0704  | -0.0778   |

**Autocorrelation Test:**

Test for autocorrelation up to order 12:

-   **Ljung-Box Q'** = 41.5484
-   **p-value** = P(Chi-square(10) \> 41.5484) = 1.741e-007

ARIMA Model Analysis Report

1.  Metrics Overview: AIC, BIC, and Hannan-Quinn

The Akaike Information Criterion (AIC), the Schwarz Criterion (Bayesian Information Criterion, BIC), and the Hannan-Quinn Information Criterion (HQIC) are metrics used to evaluate the goodness of fit of statistical models.

-   **Akaike Information Criterion (AIC):** AIC assesses the relative fit of a model while penalizing for the number of parameters. It is useful for comparing models where smaller values indicate a better fit.

-   **Bayesian Information Criterion (BIC):** BIC, also known as the Schwarz criterion, is similar to AIC but applies a more stringent penalty for additional parameters. It is more conservative, favoring simpler models.

-   **Hannan-Quinn Information Criterion (HQIC):** HQIC is another relative fit metric that falls between AIC and BIC in terms of penalizing complexity. It also provides a measure of model fit with a focus on balancing fit and simplicity.

Both BIC and HQIC are more conservative compared to AIC as they impose a higher penalty for the inclusion of additional variables. Therefore, smaller values of these metrics indicate a better fit of the model to the data.

2.  Most Conservative Metrics

Among the metrics discussed, **BIC** and **Hannan-Quinn** are the most conservative in penalizing models for additional parameters. They impose a greater penalty for model complexity, making them more stringent criteria for evaluating model fit. As a result, these metrics are typically used when a more parsimonious model is preferred.

3.  Ljung-Box Test Overview

The **Ljung-Box Q statistic** tests for the presence of autocorrelation in the residuals of a model.

-   **Null Hypothesis (H0):** There is no serial autocorrelation in the residuals.

-   **Alternative Hypothesis (Ha):** There is serial autocorrelation in the residuals.

A high p-value (greater than 0.05) suggests that the residuals are not significantly autocorrelated, indicating a good fit. Conversely, a low p-value indicates that residuals may still exhibit autocorrelation, suggesting that the model may not have fully captured the underlying data structure.

4.  ARIMA Model Comparison Table

Below is a summary table of the ARIMA models evaluated, including their adjusted R-squared, AIC, BIC, and Ljung-Box Q values:

| Model | Adjusted R-squared | AIC      | BIC      | Ljung-Box Q |
|-------|--------------------|----------|----------|-------------|
| 1     | 0.085465           | -317.254 | -298.199 | 75.3636     |
| 2     | 0.121101           | -350.557 | -326.738 | 36.8101     |
| 3     | 0.134286           | -362.107 | -333.524 | 39.2977     |
| 4     | 0.152845           | -375.588 | -323.185 | 17.9674     |

**Best Model:** Model 4 is identified as the best among the four. It has the highest adjusted R-squared value (0.152845), the lowest AIC (-375.588), and the third lowest BIC (-323.185). Although the Ljung-Box Q statistic indicates some remaining autocorrelation in the residuals, Model 4 performs best overall in terms of fit statistics.

5.  Residual Variance and Model Exploration

Examining the Ljung-Box test results suggests that there is still some autocorrelation present in the residuals of the models. The p-values for the Ljung-Box test are below 0.05, indicating that residuals are not completely free of autocorrelation. This suggests that there may be additional variance that could be captured by exploring further ARIMA models or adjusting model parameters to improve fit and reduce residual autocorrelation.

# SAS Code: ARIMA MODELS

```{sas}
LIBNAME mylib "P:\";
FILENAME bigrec "P:\fa15_data.txt" LRECL = 65576;

DATA mytemp;
    INFILE bigrec;
    INPUT 
        myid 1-7
        purchase_online_safe_aglo 5526 
        purchase_online_safe_agli 5564
        purchase_online_safe_neit 5640
        purchase_online_safe_dgli 5678
        purchase_online_safe_dglo 5716
        buy_online_aglo 5518 
        buy_online_agli 5556
        buy_online_neit 5632
        buy_online_dgli 5670
        buy_online_dglo 5708
        use_devices_for_deal_aglo 5508
        use_devices_for_deal_agli 5546
        use_devices_for_deal_neit 5622
        use_devices_for_deal_dgli 5660
        use_devices_for_deal_dglo 5698
        hear_products_email_aglo 5525 
        hear_products_email_agli 5563
        hear_products_email_neit 5639
        hear_products_email_dgli 5677
        hear_products_email_dglo 5715
        internet_chnge_shop_aglo 5495 
        internet_chnge_shop_agli 5533
        internet_chnge_shop_neit 5609
        internet_chnge_shop_dgli 5647
        internet_chnge_shop_dglo 5685
        environ_friendly_aglo 4181 
        environ_friendly_agli 4195
        environ_friendly_neit 4223
        environ_friendly_dgli 4237
        environ_friendly_dglo 4251
        recycle_prods_aglo 4190 
        recycle_prods_agli 4204
        recycle_prods_neit 4232
        recycle_prods_dgli 4246
        recycle_prods_dglo 4260
        environ_good_business_aglo 4182
        environ_good_business_agli 4196
        environ_good_business_neit 4224
        environ_good_business_dgli 4238
        environ_good_business_dglo 4252
        environ_personal_ob_aglo 4184
        environ_personal_ob_agli 4198
        environ_personal_ob_neit 4226
        environ_personal_ob_dgli 4240
        environ_personal_ob_dglo 4254
        comp_help_cons_env_aglo 4183 
        comp_help_cons_env_agli 4197
        comp_help_cons_env_neit 4225
        comp_help_cons_env_dgli 4239
        comp_help_cons_env_dglo 4253
        adidas_brand 42607;
RUN;

/* Create five-point scale variables */
PROC FORMAT;
    VALUE myscale
        1 = "disagree a lot"
        2 = "disagree a little"
        3 = "neither agree nor disagree"
        4 = "agree a little"
        5 = "agree a lot";
    VALUE yesno
        0 = "no"
        1 = "yes";
RUN;

DATA myvars;
    SET mytemp;
    /* Conversion to five-point scale variables */
    /* ... (conversion code) ... */
    FORMAT purchase_online_safe buy_online use_devices_for_deal hear_products_email internet_chnge_shop environ_friendly recycle_prods environ_good_business environ_personal_ob comp_help_cons_env myscale.
           adidas yesno.;
RUN;

/* Factor Analysis */
PROC FACTOR DATA = myvars 
    MAXITER=100
    METHOD=principal
    MINEIGEN=1
    ROTATE=varimax
    MSA
    SCREE
    SCORE
    PRINT
    NFACTORS=2
    OUT=myscores;
    VAR purchase_online_safe 
        buy_online 
        use_devices_for_deal 
        hear_products_email 
        internet_chnge_shop 
        environ_friendly 
        recycle_prods 
        environ_good_business 
        environ_personal_ob 
        comp_help_cons_env;
RUN;

DATA myscores1;
    SET myscores;
    RENAME factor1 = onlineshopper;
    RENAME factor2 = environconscious;
    RENAME my_id = resp_id;
RUN;
